On Designing Structure-Aware High-Performance Graph Algorithms (PhD Thesis)

Mohsen Koohi Esfahani
Supervisors: Prof. Hans Vandierendonck and Dr. Peter Kilpatrick

Thesis on QUB Pure Portal
Thesis in PDF format

Graph algorithms find several usages in industry, science, humanities, and technology. The fast-growing size of graph datasets in the context of the processing model of the current hardware has resulted in different bottlenecks such as memory locality, work-efficiency, and load-balance that degrade the performance. To tackle these limitations, high-performance computing considers different aspects of the execution in order to design optimized algorithms through efficient usage of hardware resources.

The main idea in this thesis is to analyze the structure of graphs to exploit special features that are key to introduce new graph algorithms with optimized performance.

First, we study the structure of real-world graph datasets with skewed degree distribution and the applicability of graph relabeling algorithms as the main restructuring tools to improve performance and memory locality. To that end, we introduce novel locality metrics including Cache Miss Rate Degree Distribution, Effective Cache Size, Push Locality and Pull Locality, and Degree Range Decomposition.

Based on this structural analysis, we introduce the Uniform Memory Demands strategy that (i) recognizes diverse memory demands and behaviours as a source of performance inefficiency, (ii) separates contrasting memory demands into groups with uniform behaviours across each group, and (iii) designs bespoke data structures and algorithms for each group in order to satisfy memory demands with the lowest overhead.

We apply the Uniform Memory Demands strategy to design three graph algorithms with optimized performance: (i) the SAPCo Sort algorithm as a parallel counting sort algorithm that is faster than comparison-based sorting algorithms in degree-ordering of power-law graphs, (ii) the iHTL algorithm that optimizes locality in Sparse Matrix-Vector (SpMV) Multiplication graph algorithms by extracting dense subgraphs containing incoming edges to in-hubs and processing them in the push direction, and (iii) the LOTUS algorithm that optimizes locality in Triangle Counting by separating different caching demands and deploying specific data structure and algorithm for each of them.


  title  = {On Designing Structure-Aware High-Performance Graph Algorithms},
  author = {Mohsen Koohi Esfahani},
  year   = 2022,
  url    = {https://blogs.qub.ac.uk/DIPSA/On-Designing-Structure-Aware-High-Performance-Graph-Algorithms-PhD-Thesis/},
  school = {PhD thesis, Queen's University Belfast}

Related Posts


The GraphGrind Framework: Fast Graph Analytics on Large Shared-Memory Systems (PhD Thesis)

Thesis on QUB Pure Portal
Thesis in PDF Format

Author: Jiawen Sun, https://www.linkedin.com/in/jiawen-sun-33b368103/

As shared memory systems support terabyte-sized main memory, they provide an opportunity to perform efficient graph analytics on a single machine. Graph analytics is characterised by frequent synchronisation, which is addressed in part by shared memory systems. However, performance is limited by load imbalance and poor memory locality, which originate in the irregular structure of small-world graphs.
This dissertation demonstrates how graph partitioning can be used to optimise (i) load balance, (ii) Non-Uniform Memory Access (NUMA) locality and (iii) temporal locality of graph partitioning in shared memory systems. The developed techniques are implemented in GraphGrind, a new shared memory graph analytics framework.

At first, this dissertation shows that heuristic edge-balanced partitioning results in an imbalance in the number of vertices per partition. Thus, load imbalance exists between partitions, either for loops iterating over vertices, or for loops iterating over edges. To address this issue, this dissertation introduces a classification of algorithms to distinguish whether they algorithmically benefit from edge-balanced or vertex-balanced partitioning. This classification supports the adaptation of partitions to the characteristics of graph algorithms. Evaluation in GraphGrind, shows that this outperforms state-of-the-art graph analytics frameworks for shared memory including Ligra by 1.46x on average, and Polymer by 1.16x on average, using a variety of graph algorithms and datasets.

Secondly, this dissertation demonstrates that increasing the number of graph partitions is effective to improve temporal locality due to smaller working sets.
However, the increasing number of partitions results in vertex replication in some graph data structures. This dissertation resorts to using a graph layout that is immune to vertex replication and an automatic graph traversal algorithm that extends the previously established graph traversal heuristics to a 3-way graph layout choice is designed. This new algorithm furthermore depends upon the classification of graph algorithms introduced in the first part of the work. These techniques achieve an average speedup of 1.79x over Ligra and 1.42x over Polymer.

Finally, this dissertation presents a graph ordering algorithm to challenge the widely accepted heuristic to balance the number of edges per partition and minimise edge or vertex cut. This algorithm balances the number of edges per partition as well as the number of unique destinations of those edges. It balances edges and vertices for graphs with a power-law degree distribution. Moreover, this dissertation shows that the performance of graph ordering depends upon the characteristics of graph analytics frameworks, such as NUMA-awareness. This graph ordering algorithm achieves an average speedup of 1.87x over Ligra and 1.51x over Polymer.