On Designing Structure-Aware High-Performance Graph Algorithms (PhD Thesis)

Mohsen Koohi Esfahani
Supervisors: Prof. Hans Vandierendonck and Dr. Peter Kilpatrick

Thesis on QUB Pure Portal
Thesis in PDF format

Graph algorithms find several usages in industry, science, humanities, and technology. The fast-growing size of graph datasets in the context of the processing model of the current hardware has resulted in different bottlenecks such as memory locality, work-efficiency, and load-balance that degrade the performance. To tackle these limitations, high-performance computing considers different aspects of the execution in order to design optimized algorithms through efficient usage of hardware resources.

The main idea in this thesis is to analyze the structure of graphs to exploit special features that are key to introduce new graph algorithms with optimized performance.

First, we study the structure of real-world graph datasets with skewed degree distribution and the applicability of graph relabeling algorithms as the main restructuring tools to improve performance and memory locality. To that end, we introduce novel locality metrics including Cache Miss Rate Degree Distribution, Effective Cache Size, Push Locality and Pull Locality, and Degree Range Decomposition.

Based on this structural analysis, we introduce the Uniform Memory Demands strategy that (i) recognizes diverse memory demands and behaviours as a source of performance inefficiency, (ii) separates contrasting memory demands into groups with uniform behaviours across each group, and (iii) designs bespoke data structures and algorithms for each group in order to satisfy memory demands with the lowest overhead.

We apply the Uniform Memory Demands strategy to design three graph algorithms with optimized performance: (i) the SAPCo Sort algorithm as a parallel counting sort algorithm that is faster than comparison-based sorting algorithms in degree-ordering of power-law graphs, (ii) the iHTL algorithm that optimizes locality in Sparse Matrix-Vector (SpMV) Multiplication graph algorithms by extracting dense subgraphs containing incoming edges to in-hubs and processing them in the push direction, and (iii) the LOTUS algorithm that optimizes locality in Triangle Counting by separating different caching demands and deploying specific data structure and algorithm for each of them.

Bibtex

@phdthesis{ODSAGA-ethos.874822,
  title  = {On Designing Structure-Aware High-Performance Graph Algorithms},
  author = {Mohsen Koohi Esfahani},
  year   = 2022,
  url    = {https://blogs.qub.ac.uk/DIPSA/On-Designing-Structure-Aware-High-Performance-Graph-Algorithms-PhD-Thesis/},
  school = {Queen's University Belfast},
  EThOSID = {uk.bl.ethos.874822}
}

Related Posts

LaganLighter

LaganLighter Source Code


Repository

https://github.com/DIPSA-QUB/LaganLighther

Algorithms in This Repo

Graph Types

LaganLighter supports the following graph formats:

  • CSR/CSC graph in text format, for testing. This format has 4 lines: (i) number of vertices (|V|), (ii) number of edges (|E|), (iii) |V| space-separated numbers showing offsets of the vertices, and (iv) |E| space-separated numbers indicating edges.
  • CSR WebGraph format (to be added)
  • CSR/CSC binary format (to be added)
  • CSR/CSC sbin (separated binary) format (to be added)

Measurements

In addition to execution time, we use the PAPI library to measure hardware counters such as L3 cache misses, hardware instructions, DTLB misses, and load and store memory instructions. ( papi_(init/start/reset/stop) and (print/reset)_hw_events functions defined in omp.c).

To measure load balance, we measure the total time of executing a loop and the time each thread spends in this loop (mt and ttimes in the following sample code). Using these values, PTIP macro (defined in omp.c) calculates the percentage of average idle time (as an indicator of load imbalance) and prints it with the total time (mt).

mt = - get_nano_time()
#pragma omp parallel  
{
   unsigned tid = omp_get_thread_num();
   ttimes[tid] = - get_nano_time();
	
   #pragma omp for nowait
   for(unsigned int v = 0; v < g->vertices_count; v++)
   {
      // .....
   }
   ttimes[tid] += get_nano_time();
}
mt += get_nano_time();
PTIP("Step ... ");

As an example, the following execution of Thrifty, shows that the “Zero Planting” step has been performed in 8.98 milliseconds and with a 8.22% load imbalance, while processors have been idle for 72.22% of the execution time, on average, in the “Initial Push” step.

NUMA-Aware and Locality-Preserving Partitioning and Scheduling

In order to assign consecutive partitions (vertices and/or their edges) to each parallel processor, we initially divide partitions and assign a number of consecutive partitions to each thread. Then, we specify the order of victim threads in the work-stealing process. During the initialization of LaganLighter parallel processing environment (in initialize_omp_par_env() function defined in file omp.c), for each thread, we create a list of threads as consequent victims of stealing.

A thread, first, steals jobs (i.e., partitions) from consequent threads in the same NUMA node and then from the threads in consequent NUMA nodes. As an example, the following image shows the stealing order of a 24-core machine with 2 NUMA nodes. This shows that thread 1 steals from threads 2, 3, …,11, and ,0 running on the same NUMA socket and then from threads 13, 14, …, 23, and 12 running on the next NUMA socket.

We use dynamic_partitioning_...() functions (in file partitioning.c) to process partitions by threads in the specified order. A sample code is in the following:

struct dynamic_partitioning* dp = dynamic_partitioning_initialize(pe, partitions_count);

#pragma omp parallel  
{
   unsigned int tid = omp_get_thread_num();
   unsigned int partition = -1U;		

   while(1)
   {
      partition = dynamic_partitioning_get_next_partition(dp, tid, partition);
      if(partition == -1U)
	 break; 

      for(v = start_vertex[partition]; v < start_vertex[partition + 1]; v++)
      {
	// ....
       }
   }
}

dynamic_partitioning_reset(dp);

Bugs & Support

As “we write bugs that in particular cases have been tested to work correctly”, we try to evaluate and validate the algorithms and their implementations. If you receive wrong results or you are suspicious about parts of the code, please contact us or submit an issue.

License

Licensed under the GNU v3 General Public License, as published by the Free Software Foundation. You must not use this Software except in compliance with the terms of the License. Unless required by applicable law or agreed upon in writing, this Software is distributed on an “as is” basis, without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose, neither express nor implied. For details see terms of the License.

Copyright 2019-2022 The Queen’s University of Belfast, Northern Ireland, UK

LaganLighter

Related Posts

MASTIFF: Structure-Aware Minimum Spanning Tree/Forest – ICS’22

36th ACM International Conference on Supercomputing 2022
June 27-30, 2022
Acceptance Rate: 25%

DOI: 10.1145/3524059.3532365
Authors’ Copy (PDF Format)

The Minimum Spanning Forest (MSF) problem finds usage in many different applications. While theoretical analysis shows that linear-time solutions exist, in practice, parallel MSF algorithms remain computationally demanding due to the continuously increasing size of data sets.

In this paper, we study the MSF algorithm from the perspective of graph structure and investigate the implications of the power-law degree distribution of real-world graphs
on this algorithm.

We introduce the MASTIFF algorithm as a structure-aware MSF algorithm that optimizes work efficiency by (1) dynamically tracking the largest forest component of each graph component and exempting them from processing, and (2) by avoiding topology-related operations such as relabeling and merging neighbour lists.

The evaluations on 2 different processor architectures with up to 128 cores and on graphs of up to 124 billion edges, shows that Mastiff is 3.4–5.9× faster than previous works.

Code Availability
The source-code of MASTIFF is available on LaganLighter Repository (alg3_mastiff.c and msf.c files). A sample execution of this source code for “Twitter-MPI” graph is shown in the following:

BibTex

@INPROCEEDINGS{10.1145/3524059.3532365,
author = {Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
title = {{MASTIFF}: Structure-Aware Minimum Spanning Tree/Forest},
year = {2022},
isbn = {},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3524059.3532365},
doi = {10.1145/3524059.3532365},
booktitle = {Proceedings of the 36th ACM International Conference on Supercomputing},
numpages = {13}
}

LaganLighter

Related Posts

SAPCo Sort: Optimizing Degree-Ordering for Power-Law Graphs – ISPASS’22 (Poster)

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2022)
May 22-24, 2022

DOI: 10.1109/ISPASS55109.2022.00015

Authors’ Copy (PDF)

While counting sort has a better complexity than comparison-based sorting algorithms, its parallelization suffers from high performance overhead and/or has a memory complexity that depends on the numbers of threads and elements.

In this paper, we explore the optimization of parallel counting sort for degree-ordering of real-world graphs with power-law degree distribution and we introduce the Structure-Aware Parallel Counting (SAPCo) Sort algorithm that leverages the skewed degree distribution to accelerate sorting.

The evaluation for graphs of up to 3.6 billion vertices shows that SAPCo sort is, on average, 1.7-33.5 times faster than state-of-the-art sorting algorithms such as counting sort, radix sort, and sample sort.

For a detailed explanation of the algorithms, please refer to Chapter 5 of the On Designing Structure-Aware High-Performance Graph Algorithms thesis.

BibTex

@INPROCEEDINGS{10.1109/ISPASS55109.2022.00015,
  author={Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
  booktitle={2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)}, 
  title={{SAPCo Sort}: Optimizing Degree-Ordering for Power-Law Graphs}, 
  year={2022},
  volume={},
  number={},
  pages={},
  publisher={IEEE Computer Society},
  doi={10.1109/ISPASS55109.2022.00015}
}

Code Availability
The source-code of SAPCo is available on LaganLighter Repository (alg1_sapco_sort.c and relabel.c files). A sample execution of this source code for “Twitter-MPI” graph is shown in the following:


LaganLighter

Related Posts

LOTUS: Locality Optimizing Triangle Counting – PPOPP’22

27th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP 2022)
April 2-6, 2022
Acceptance Rate: 31%

DOI: 10.1145/3503221.3508402

Authors’ Copy (PDF Format)

Triangle Counting (TC) is a basic graph algorithm and is widely used in different fields of science, humanities and technology. The great size of real-world graphs with skewed degree distribution is a prohibiting factor in efficient TC.

In this paper, we study the implications of the power-law degree distribution of graph datasets on memory utilization in TC and we explain how a great percentage of triangles are formed around a small number of vertices with very great degrees (that are called hubs).

Using these novel observations, we present the LOTUS algorithm as a structure-aware and locality optimizing TC that separates counting triangles formed by hub and non-hub edges. Lotus introduces bespoke TC algorithms and data structures for different edges in order to (i) reduce the size of data structures that are target of random memory accesses, (ii) to optimize cache capacity utilization, (iii) to provide better load balance, and (iv) to avoid fruitless searches.

To provide better CPU utilization, we introduce the Squared Edge Tiling algorithm that divides a large task into smaller ones when the size of work for each edge in the neighbour-list depends on the index of that edge.

We evaluate the Lotus algorithm on 3 different processor architectures and for real-world graph datasets with up to 162 billion edges that shows LOTUS provides 2.2-5.5 times speedup in comparison to the previous works.

  • LOTUS: Locality Optimizing Triangle Counting
  • LOTUS: Locality Optimizing Triangle Counting
  • LOTUS: Locality Optimizing Triangle Counting - Forward Algorithm
  • LOTUS: Locality Optimizing Triangle Counting - Analysis of Forward Algorithm
  • LOTUS: Locality Optimizing Triangle Counting - Analysis of Forward Algorithm
  • LOTUS: Locality Optimizing Triangle Counting - Analysis of Forward Algorithm
  • LOTUS: Locality Optimizing Triangle Counting - Analysis of Forward Algorithm
  • LOTUS: Locality Optimizing Triangle Counting - Lotus Graph Structure
  • LOTUS: Locality Optimizing Triangle Counting -
  • LOTUS: Locality Optimizing Triangle Counting - HHH & HHN
  • LOTUS: Locality Optimizing Triangle Counting - HNN
  • LOTUS: Locality Optimizing Triangle Counting - NNN
  • LOTUS: Locality Optimizing Triangle Counting - Evaluation
  • LOTUS: Locality Optimizing Triangle Counting - Conclusion
  • LOTUS: Locality Optimizing Triangle Counting -
  • LOTUS: Locality Optimizing Triangle Counting -

Code Availability
The source-code will be published soon.

BibTex

@INPROCEEDINGS{10.1145/3503221.3508402,
  author = {Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
  booktitle = {27th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP 2022)},  
  title = {LOTUS: Locality Optimizing Triangle Counting}, 
  year = {2022},
  numpages = {15},
  pages={219–233},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  doi = {10.1145/3503221.3508402}
}

LaganLighter

Related Posts

Thrifty Label Propagation: Fast Connected Components for Skewed-Degree Graphs – IEEE CLUSTER’21

IEEE CLUSTER 2021

IEEE CLUSTER 2021
7-10 September

Acceptance rate: 29.4%

DOI: 10.1109/Cluster48925.2021.00042
IEEE Xplore
PDF Version (Authors’ Copy)

Thrifty introduces four optimization techniques to Label Propagation Connected Components:

1) Unified Labels Array accelerates label propagation by allowing the latest label of each vertex to be read in processing other vertices.

2) Zero Convergence optimizes work-efficiency in the pull iterations of Label Propagation by skipping converged vertices.

3) Zero Planting selects the best start propagating point and increases the convergence rate and removes pull iterations that are required for the lowest label to reach the core of graph.

4) Initial Push technique makes the first iteration work efficient by skipping processing edges of vertices that probability of convergence is very small.

Based on these optimizations, Thrifty provides 1.4✕ speedup to Afforest, 6.6✕ to Jayanti-Tarjan, 14.3✕ to BFS-CC, and 25.0✕ to Direction Optimizing Label Propagation.

  • Thrifty Label Propagation: Fast Connected Components For Skewed Degree Graphs
  • Thrifty Label Propagation: Outline
  • Thrifty Label Propagation: Background
  • Thrifty Label Propagation: Background
  • Thrifty Label Propagation: Background
  • Thrifty Label Propagation
  • Thrifty Label Propagation: Unified Labels Array
  • Thrifty Label Propagation: Unified Labels Array
  • Thrifty Label Propagation: Zero Convergence
  • Thrifty Label Propagation: Zero Convergence
  • Thrifty Label Propagation: Zero Planting
  • Thrifty Label Propagation: Zero Planting
  • Thrifty Label Propagation:  Initial Push
  • Thrifty Label Propagation:  Evaluation
  • Thrifty Label Propagation:  Evaluation
  • Thrifty Label Propagation:  Evaluation
  • Thrifty Label Propagation:  Conclusion
  • Thrifty Label Propagation:  Thanks
  • Thrifty Label Propagation:  A Gift from QUB


Code Availability
The source-code of Thrifty is available on LaganLighter Repository (alg2_thrifty.c and cc.c files). A sample execution of this source code for “Twitter-MPI” graph is shown in the following:


Bibtex

@INPROCEEDINGS{10.1109/Cluster48925.2021.00042,
  author={Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
  booktitle={2021 IEEE International Conference on Cluster Computing (CLUSTER)}, 
  title={Thrifty Label Propagation: Fast Connected Components for Skewed-Degree Graphs}, 
  year={2021},
  volume={},
  number={},
  pages={226-237},
  publisher={IEEE Computer Society},
  doi={10.1109/Cluster48925.2021.00042}
}

Related Posts

Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing – ICPP’21

ICPP'21

50th International Conference on Parallel Processing (ICPP’21)
August 9-12, 2021

Acceptance Rate: 26.4%

DOI:10.1145/3472456.3472462
ACM Digital Library
PDF Version (Authors’ Copy)

This paper investigates the implications made by the structure of real-world graphs with power-law degree distribution on the locality of SpMV graph analytics, and by considering the efficacy of locality optimizing graph reordering algorithms (such as SlashBurn, GOrder, and Rabbit-Order) shows that irregular datasets requires special traversals in order to improve locality for hub vertices that dedicate a large portion of the processing time to themselves.

We introduce in-Hub Temporal Locality (iHTL) as a structure-aware and cache-friendly graph traversal that optimizes locality in pull traversal. iHTL identifies different blocks in the adjacency matrix of a graph and applies a suitable traversal direction (push or pull) for each block based on its contents. In other words, iHTL optimizes locality of one traversal of all edges of the graph by:

(1) applying push direction for flipped blocks containing edges to in-hubs, and
(2) applying pull direction for processing sparse block containing edges to non-hubs.

Moreover, iHTL introduces a new algorithm to efficiently identify the number of flipped blocks by investigating connection between hub vertices of the graph. This allows iHTL to create flipped blocks as much as the graph structure requires and makes iHTL suitable for a wide range of different real-world graph datasets like social networks and web graphs.

iHTL is 1.5× – 2.4× faster than pull and 4.8× – 9.5× faster than push in state-of-the-art graph processing frameworks. More importantly, iHTL is 1.3× – 1.5× faster than pull traversal of state-of-the-art locality optimizing reordering algorithms such as SlashBurn, GOrder, and Rabbit-Order while reduces the preprocessing time by 780×, on average.

  • Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing
  • Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing : Outline
  • Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing : Introduction
  • Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing : Pull vs Push
  • Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing : Is Pull A Suitable Direction
  • Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing : iHTL: in-Hub Temporal Locality
  • Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing : iHTL Graph Structure
  • Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing : SpMV in iHTL
  • Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing : Evaluation
  • Exploiting in-Hub Temporal Locality in SpMV-based Graph : Conclusion
  • Exploiting in-Hub Temporal Locality in SpMV-based Graph : Thanks
  • Exploiting in-Hub Temporal Locality in SpMV-based Graph : A Gift From QUB

Code Availability
The source-code will be published soon.

BibTex


@INPROCEEDINGS{10.1145/3472456.3472462,
author = {Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
title = {Exploiting In-Hub Temporal Locality In SpMV-Based Graph Processing},
year = {2021},
isbn = {9781450390682},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3472456.3472462},
doi = {10.1145/3472456.3472462},
booktitle = {50th International Conference on Parallel Processing},
numpages = {10},
location = {Lemont, IL, USA},
series = {ICPP 2021}
}

LaganLighter

Related Posts

The GraphGrind Framework: Fast Graph Analytics on Large Shared-Memory Systems (PhD Thesis)

Thesis on QUB Pure Portal
Thesis in PDF Format

Author: Jiawen Sun, https://www.linkedin.com/in/jiawen-sun-33b368103/

As shared memory systems support terabyte-sized main memory, they provide an opportunity to perform efficient graph analytics on a single machine. Graph analytics is characterised by frequent synchronisation, which is addressed in part by shared memory systems. However, performance is limited by load imbalance and poor memory locality, which originate in the irregular structure of small-world graphs.
This dissertation demonstrates how graph partitioning can be used to optimise (i) load balance, (ii) Non-Uniform Memory Access (NUMA) locality and (iii) temporal locality of graph partitioning in shared memory systems. The developed techniques are implemented in GraphGrind, a new shared memory graph analytics framework.

At first, this dissertation shows that heuristic edge-balanced partitioning results in an imbalance in the number of vertices per partition. Thus, load imbalance exists between partitions, either for loops iterating over vertices, or for loops iterating over edges. To address this issue, this dissertation introduces a classification of algorithms to distinguish whether they algorithmically benefit from edge-balanced or vertex-balanced partitioning. This classification supports the adaptation of partitions to the characteristics of graph algorithms. Evaluation in GraphGrind, shows that this outperforms state-of-the-art graph analytics frameworks for shared memory including Ligra by 1.46x on average, and Polymer by 1.16x on average, using a variety of graph algorithms and datasets.

Secondly, this dissertation demonstrates that increasing the number of graph partitions is effective to improve temporal locality due to smaller working sets.
However, the increasing number of partitions results in vertex replication in some graph data structures. This dissertation resorts to using a graph layout that is immune to vertex replication and an automatic graph traversal algorithm that extends the previously established graph traversal heuristics to a 3-way graph layout choice is designed. This new algorithm furthermore depends upon the classification of graph algorithms introduced in the first part of the work. These techniques achieve an average speedup of 1.79x over Ligra and 1.42x over Polymer.

Finally, this dissertation presents a graph ordering algorithm to challenge the widely accepted heuristic to balance the number of edges per partition and minimise edge or vertex cut. This algorithm balances the number of edges per partition as well as the number of unique destinations of those edges. It balances edges and vertices for graphs with a power-law degree distribution. Moreover, this dissertation shows that the performance of graph ordering depends upon the characteristics of graph analytics frameworks, such as NUMA-awareness. This graph ordering algorithm achieves an average speedup of 1.87x over Ligra and 1.51x over Polymer.