Comprehensive evaluation is one of the basis of experimental science. In High-Performance Graph Processing, a thorough evaluation of contributions becomes more achievable by supporting common input formats over different frameworks. However, each framework creates its specific format, which may not support reading large-scale real-world graph datasets. This shows a demand for high-performance libraries capable of loading graphs to (i) accelerate designing new graph algorithms, (ii) to evaluate the contributions on a wide range of graph algorithms, and (iii) to facilitate easy and fast comparison over different graph frameworks.

To that end, we present ParaGrapher, a high-performance API and library for loading large-scale and compressed graphs. ParaGrapher supports different types of requests for accessing graphs in shared- and distributed-memory and out-of-core graph processing. We explain the design of ParaGrapher and present a performance model of graph decompression, which is used for evaluation of ParaGrapher over three storage types.

Our evaluation shows that by decompressing compressed graphs in WebGraph format, ParaGrapher delivers up to 3.2 times speedup in loading and up to 5.2 times speedup in end-to-end execution in comparison to the binary and textual formats.

@misc{paragrapher-arxiv,
title = { Selective Parallel Loading of Large-Scale
Compressed Graphs with ParaGrapher},
author = { {Mohsen} {Koohi Esfahani} and Marco D'Antonio and
Syed Ibtisam Tauhidi and Thai Son Mai and
Hans Vandierendonck},
year = {2024},
eprint = {2404.19735},
archivePrefix = {arXiv},
primaryClass = {cs.AR},
doi = {10.48550/arXiv.2404.19735}
}

The OS cache of storage contents have been dropped after each evaluation (sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'). The flushcache.c file (https://github.com/DIPSA-QUB/ParaGrapher/blob/main/test/flushcache.c) can be used with the same functionality for users without sudo access, however, it usually takes more time to be finished.

For LustreFS, we have repeated the evaluation of read and pread using O_DIRECT flag as this flag prevents client-side caching.

For HDD and SSD experiments, we have used a machine with Intel W-2295 3.00GHz CPU, 18 cores, 36 hyper-threads, 24MB L3 cache, 256 GB DDR4 2933Mhz memory, running Debian 12 Linux 6.1. For LustreFS, we have used a machine with 2TB 3.2GHz DDR4 memory, 2 AMD 7702 CPUs, in total, 128 cores, 256 threads.

The results of the evaluation using read_bandwidth.c are in the following table. The values are Bandwidth in MB/s. Also, 1-2 digits close to each number with a white background are are percentage of load imbalance between parallel threads.

ParaGrapher source code for accessing WebGraphs have been published. The supported graph types are:

PARAGRAPHER_CSX_WG_400_AP: graphs compressed in WebGraph format with 4 Bytes ID per vertex. Graphs in this category: LAW web graphs (https://law.di.unimi.it/datasets.php) .

PARAGRAPHER_CSX_WG_404_AP: graphs compressed in WebGraph format with 4 Bytes ID per vertex and 4 Bytes integer weights per edge. Graphs in this category: MS-BioGraphs (https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs/).

ParaGrapher uses its asynchronous and parallel API to implement these graph types. The user needs to implement a callback function that is called by the API upon completion of reading a block of edges. Poplar uses a shared memory for interaction between its C library and the Java library that deploys the WebGraph framework.

For further details, please refer to Poplar source code repository: https://github.com/DIPSA-QUB/ParaGrapher, particularly, src/webgraph.c and src/WG*.java files.

Graph algorithms find several usages in industry, science, humanities, and technology. The fast-growing size of graph datasets in the context of the processing model of the current hardware has resulted in different bottlenecks such as memory locality, work-efficiency, and load-balance that degrade the performance. To tackle these limitations, high-performance computing considers different aspects of the execution in order to design optimized algorithms through efficient usage of hardware resources.

The main idea in this thesis is to analyze the structure of graphs to exploit special features that are key to introduce new graph algorithms with optimized performance.

First, we study the structure of real-world graph datasets with skewed degree distribution and the applicability of graph relabeling algorithms as the main restructuring tools to improve performance and memory locality. To that end, we introduce novel locality metrics including Cache Miss Rate Degree Distribution, Effective Cache Size, Push Locality and Pull Locality, and Degree Range Decomposition.

Based on this structural analysis, we introduce the Uniform Memory Demands strategy that (i) recognizes diverse memory demands and behaviours as a source of performance inefficiency, (ii) separates contrasting memory demands into groups with uniform behaviours across each group, and (iii) designs bespoke data structures and algorithms for each group in order to satisfy memory demands with the lowest overhead.

We apply the Uniform Memory Demands strategy to design three graph algorithms with optimized performance: (i) the SAPCo Sort algorithm as a parallel counting sort algorithm that is faster than comparison-based sorting algorithms in degree-ordering of power-law graphs, (ii) the iHTL algorithm that optimizes locality in Sparse Matrix-Vector (SpMV) Multiplication graph algorithms by extracting dense subgraphs containing incoming edges to in-hubs and processing them in the push direction, and (iii) the LOTUS algorithm that optimizes locality in Triangle Counting by separating different caching demands and deploying specific data structure and algorithm for each of them.

Bibtex

@phdthesis{ODSAGA-ethos.874822,
title = {On Designing Structure-Aware High-Performance Graph Algorithms},
author = {Mohsen Koohi Esfahani},
year = 2022,
url = {https://blogs.qub.ac.uk/DIPSA/On-Designing-Structure-Aware-High-Performance-Graph-Algorithms-PhD-Thesis/},
school = {Queen's University Belfast},
EThOSID = {uk.bl.ethos.874822}
}

LaganLighter supports the following graph formats:

CSR/CSC graph in text format, for testing. This format has 4 lines: (i) number of vertices (|V|), (ii) number of edges (|E|), (iii) |V| space-separated numbers showing offsets of the vertices, and (iv) |E| space-separated numbers indicating edges.

In addition to execution time, we use the PAPI library to measure hardware counters such as L3 cache misses, hardware instructions, DTLB misses, and load and store memory instructions. ( papi_(init/start/reset/stop) and (print/reset)_hw_events functions defined in omp.c).

To measure load balance, we measure the total time of executing a loop and the time each thread spends in this loop (mt and ttimes in the following sample code). Using these values, PTIP macro (defined in omp.c) calculates the percentage of average idle time (as an indicator of load imbalance) and prints it with the total time (mt).

mt = - get_nano_time()
#pragma omp parallel
{
unsigned tid = omp_get_thread_num();
ttimes[tid] = - get_nano_time();
#pragma omp for nowait
for(unsigned int v = 0; v < g->vertices_count; v++)
{
// .....
}
ttimes[tid] += get_nano_time();
}
mt += get_nano_time();
PTIP("Step ... ");

As an example, the following execution of Thrifty, shows that the “Zero Planting” step has been performed in 8.98 milliseconds and with a 8.22% load imbalance, while processors have been idle for 72.22% of the execution time, on average, in the “Initial Push” step.

NUMA-Aware and Locality-Preserving Partitioning and Scheduling

In order to assign consecutive partitions (vertices and/or their edges) to each parallel processor, we initially divide partitions and assign a number of consecutive partitions to each thread. Then, we specify the order of victim threads in the work-stealing process. During the initialization of LaganLighter parallel processing environment (in initialize_omp_par_env() function defined in file omp.c), for each thread, we create a list of threads as consequent victims of stealing.

A thread, first, steals jobs (i.e., partitions) from consequent threads in the same NUMA node and then from the threads in consequent NUMA nodes. As an example, the following image shows the stealing order of a 24-core machine with 2 NUMA nodes. This shows that thread 1 steals from threads 2, 3, …,11, and ,0 running on the same NUMA socket and then from threads 13, 14, …, 23, and 12 running on the next NUMA socket.

We use dynamic_partitioning_...() functions (in file partitioning.c) to process partitions by threads in the specified order. A sample code is in the following:

As “we write bugs that in particular cases have been tested to work correctly”, we try to evaluate and validate the algorithms and their implementations. If you receive wrong results or you are suspicious about parts of the code, please contact us or submit an issue.

License

Licensed under the GNU v3 General Public License, as published by the Free Software Foundation. You must not use this Software except in compliance with the terms of the License. Unless required by applicable law or agreed upon in writing, this Software is distributed on an “as is” basis, without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose, neither express nor implied. For details see terms of the License.

Copyright 2022 The Queen’s University of Belfast, Northern Ireland, UK

The Minimum Spanning Forest (MSF) problem finds usage in many different applications. While theoretical analysis shows that linear-time solutions exist, in practice, parallel MSF algorithms remain computationally demanding due to the continuously increasing size of data sets.

In this paper, we study the MSF algorithm from the perspective of graph structure and investigate the implications of the power-law degree distribution of real-world graphs on this algorithm.

We introduce the MASTIFF algorithm as a structure-aware MSF algorithm that optimizes work efficiency by (1) dynamically tracking the largest forest component of each graph component and exempting them from processing, and (2) by avoiding topology-related operations such as relabeling and merging neighbour lists.

The evaluations on 2 different processor architectures with up to 128 cores and on graphs of up to 124 billion edges, shows that Mastiff is 3.4–5.9× faster than previous works.

Code Availability The source-code of MASTIFF is available onLaganLighter Repository (alg3_mastiff.c and msf.c files). A sample execution of this source code for “Twitter-MPI” graph is shown in the following:

BibTex

@INPROCEEDINGS{10.1145/3524059.3532365,
author = {Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
title = {{MASTIFF}: Structure-Aware Minimum Spanning Tree/Forest},
year = {2022},
isbn = {},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3524059.3532365},
doi = {10.1145/3524059.3532365},
booktitle = {Proceedings of the 36th ACM International Conference on Supercomputing},
numpages = {13}
}

While counting sort has a better complexity than comparison-based sorting algorithms, its parallelization suffers from high performance overhead and/or has a memory complexity that depends on the numbers of threads and elements.

In this paper, we explore the optimization of parallel counting sort for degree-ordering of real-world graphs with skewed degree distribution and we introduce the Structure-Aware Parallel Counting (SAPCo) Sort algorithm that leverages the skewed degree distribution to accelerate sorting.

The evaluation for graphs of up to 3.6 billion vertices shows that SAPCo sort is, on average, 1.7-33.5 times faster than state-of-the-art sorting algorithms such as counting sort, radix sort, and sample sort.

@INPROCEEDINGS{10.1109/ISPASS55109.2022.00015,
author={Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
booktitle={2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)},
title={{SAPCo Sort}: Optimizing Degree-Ordering for Power-Law Graphs},
year={2022},
volume={},
number={},
pages={},
publisher={IEEE Computer Society},
doi={10.1109/ISPASS55109.2022.00015}
}

Code Availability The source-code of SAPCo is available onLaganLighter Repository(alg1_sapco_sort.c and relabel.c files). A sample execution of this source code for “Twitter-MPI” graph is shown in the following:

Triangle Counting (TC) is a basic graph algorithm and is widely used in different fields of science, humanities and technology. The great size of real-world graphs with skewed degree distribution is a prohibiting factor in efficient TC.

In this paper, we study the implications of the power-law degree distribution of graph datasets on memory utilization in TC and we explain how a great percentage of triangles are formed around a small number of vertices with very great degrees (that are called hubs).

Using these novel observations, we present the LOTUS algorithm as a structure-aware and locality optimizing TC that separates counting triangles formed by hub and non-hub edges. Lotus introduces bespoke TC algorithms and data structures for different edges in order to (i) reduce the size of data structures that are target of random memory accesses, (ii) to optimize cache capacity utilization, (iii) to provide better load balance, and (iv) to avoid fruitless searches.

To provide better CPU utilization, we introduce the Squared Edge Tiling algorithm that divides a large task into smaller ones when the size of work for each edge in the neighbour-list depends on the index of that edge.

We evaluate the Lotus algorithm on 3 different processor architectures and for real-world graph datasets with up to 162 billion edges that shows LOTUS provides 2.2-5.5 times speedup in comparison to the previous works.

Code Availability The source-code will be published soon.

BibTex

@INPROCEEDINGS{10.1145/3503221.3508402,
author = {Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
booktitle = {27th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP 2022)},
title = {LOTUS: Locality Optimizing Triangle Counting},
year = {2022},
numpages = {15},
pages={219–233},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
doi = {10.1145/3503221.3508402}
}

Graph reordering algorithms try to improve locality of graph algorithms by assigning new IDs to vertices that ultimately changes the order of random memory accesses. While graph relabeling algorithms such as SlashBurn, GOrder, and Rabbit-Order provide better locality, it is not clear how they affect graph processing and different graph datasets , mainly, for three reasons: (1) The large size of datasets, (2) The lack of suitable measurement tools, and (3) Disparate characteristics of graphs. The paucity of analysis has also inhibited the design of more efficient RAs.

This paper introduces a number of metrics and tools to investigate the functionality of graph reordering algorithms and their effects on different real-world graph datasets: (1) We introduce the Cache Miss Rate Degree Distribution and Degree Distribution of Neighbour to Neighbour Average Distance ID (N2N AID) to show how reordering algorithms affect different vertices, (2) We introduce the Effective Cache Size as a metric to measure how much of cache capacity is used by reordered graphs for satisfying random memory accesses, (3) We introduce the Assymetricity Degree Distribution and Neighbourhood Decomposition to explain the composition of neighbourhood of vertices to explain structural differences between web graphs and social networks. (4) We investigate the effects of the structure of real-world graphs on the locality and performance of traversing graphs in pull and push directions by introducing Push Locality and Pull Locality.

Finally, we present improvements to graph reordering algorithms and propose other suggestions based on the new insights and features of real-world graphs introduced by this paper.

BibTex

@INPROCEEDINGS{10.1109/IISWC53511.2021.00020,
author={Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
booktitle={2021 IEEE International Symposium on Workload Characterization (IISWC'21)},
title={Locality Analysis of Graph Reordering Algorithms},
year={2021},
volume={},
number={},
pages={101-112},
publisher={IEEE Computer Society},
doi={10.1109/IISWC53511.2021.00020}
}