LaganLighter supports the following graph formats:

CSR/CSC graph in text format, for testing. This format has 4 lines: (i) number of vertices (|V|), (ii) number of edges (|E|), (iii) |V| space-separated numbers showing offsets of the vertices, and (iv) |E| space-separated numbers indicating edges.

CSR/CSC sbin (separated binary) format (to be added)

Measurements

In addition to execution time, we use the PAPI library to measure hardware counters such as L3 cache misses, hardware instructions, DTLB misses, and load and store memory instructions. ( papi_(init/start/reset/stop) and (print/reset)_hw_events functions defined in omp.c).

To measure load balance, we measure the total time of executing a loop and the time each thread spends in this loop (mt and ttimes in the following sample code). Using these values, PTIP macro (defined in omp.c) calculates the percentage of average idle time (as an indicator of load imbalance) and prints it with the total time (mt).

mt = - get_nano_time()
#pragma omp parallel
{
unsigned tid = omp_get_thread_num();
ttimes[tid] = - get_nano_time();
#pragma omp for nowait
for(unsigned int v = 0; v < g->vertices_count; v++)
{
// .....
}
ttimes[tid] += get_nano_time();
}
mt += get_nano_time();
PTIP("Step ... ");

As an example, the following execution of Thrifty, shows that the “Zero Planting” step has been performed in 8.98 milliseconds and with a 8.22% load imbalance, while processors have been idle for 72.22% of the execution time, on average, in the “Initial Push” step.

NUMA-Aware and Locality-Preserving Partitioning and Scheduling

In order to assign consecutive partitions (vertices and/or their edges) to each parallel processor, we initially divide partitions and assign a number of consecutive partitions to each thread. Then, we specify the order of victim threads in the work-stealing process. During the initialization of LaganLighter parallel processing environment (in initialize_omp_par_env() function defined in file omp.c), for each thread, we create a list of threads as consequent victims of stealing.

A thread, first, steals jobs (i.e., partitions) from consequent threads in the same NUMA node and then from the threads in consequent NUMA nodes. As an example, the following image shows the stealing order of a 24-core machine with 2 NUMA nodes. This shows that thread 1 steals from threads 2, 3, …,11, and ,0 running on the same NUMA socket and then from threads 13, 14, …, 23, and 12 running on the next NUMA socket.

We use dynamic_partitioning_...() functions (in file partitioning.c) to process partitions by threads in the specified order. A sample code is in the following:

As “we write bugs that in particular cases have been tested to work correctly”, we try to evaluate and validate the algorithms and their implementations. If you receive wrong results or you are suspicious about parts of the code, please contact us or submit an issue.

License

Licensed under the GNU v3 General Public License, as published by the Free Software Foundation. You must not use this Software except in compliance with the terms of the License. Unless required by applicable law or agreed upon in writing, this Software is distributed on an “as is” basis, without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose, neither express nor implied. For details see terms of the License.

Copyright 2019-2022 The Queen’s University of Belfast, Northern Ireland, UK

Triangle Counting (TC) is a basic graph algorithm and is widely used in different fields of science, humanities and technology. The great size of real-world graphs with skewed degree distribution is a prohibiting factor in efficient TC.

In this paper, we study the implications of the power-law degree distribution of graph datasets on memory utilization in TC and we explain how a great percentage of triangles are formed around a small number of vertices with very great degrees (that are called hubs).

Using these novel observations, we present the LOTUS algorithm as a structure-aware and locality optimizing TC that separates counting triangles formed by hub and non-hub edges. Lotus introduces bespoke TC algorithms and data structures for different edges in order to (i) reduce the size of data structures that are target of random memory accesses, (ii) to optimize cache capacity utilization, (iii) to provide better load balance, and (iv) to avoid fruitless searches.

To provide better CPU utilization, we introduce the Squared Edge Tiling algorithm that divides a large task into smaller ones when the size of work for each edge in the neighbour-list depends on the index of that edge.

We evaluate the Lotus algorithm on 3 different processor architectures and for real-world graph datasets with up to 162 billion edges that shows LOTUS provides 2.2-5.5 times speedup in comparison to the previous works.

Code Availability The source-code will be published soon.

BibTex

@INPROCEEDINGS{10.1145/3503221.3508402,
author = {Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
booktitle = {27th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP 2022)},
title = {LOTUS: Locality Optimizing Triangle Counting},
year = {2022},
numpages = {15},
pages={219–233},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
doi = {10.1145/3503221.3508402}
}

Graph reordering algorithms try to improve locality of graph algorithms by assigning new IDs to vertices that ultimately changes the order of random memory accesses. While graph relabeling algorithms such as SlashBurn, GOrder, and Rabbit-Order provide better locality, it is not clear how they affect graph processing and different graph datasets , mainly, for three reasons: (1) The large size of datasets, (2) The lack of suitable measurement tools, and (3) Disparate characteristics of graphs. The paucity of analysis has also inhibited the design of more efficient RAs.

This paper introduces a number of metrics and tools to investigate the functionality of graph reordering algorithms and their effects on different real-world graph datasets: (1) We introduce the Cache Miss Rate Degree Distribution and Degree Distribution of Neighbour to Neighbour Average Distance ID (N2N AID) to show how reordering algorithms affect different vertices, (2) We introduce the Effective Cache Size as a metric to measure how much of cache capacity is used by reordered graphs for satisfying random memory accesses, (3) We introduce the Assymetricity Degree Distribution and Neighbourhood Decomposition to explain the composition of neighbourhood of vertices to explain structural differences between web graphs and social networks. (4) We investigate the effects of the structure of real-world graphs on the locality and performance of traversing graphs in pull and push directions by introducing Push Locality and Pull Locality.

Finally, we present improvements to graph reordering algorithms and propose other suggestions based on the new insights and features of real-world graphs introduced by this paper.

BibTex

@INPROCEEDINGS{10.1109/IISWC53511.2021.00020,
author={Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
booktitle={2021 IEEE International Symposium on Workload Characterization (IISWC'21)},
title={Locality Analysis of Graph Reordering Algorithms},
year={2021},
volume={},
number={},
pages={101-112},
publisher={IEEE Computer Society},
doi={10.1109/IISWC53511.2021.00020}
}

This paper investigates the implications made by the structure of real-world graphs with power-law degree distribution on the locality of SpMV graph analytics, and by considering the efficacy of locality optimizing graph reordering algorithms (such as SlashBurn, GOrder, and Rabbit-Order) shows that irregular datasets requires special traversals in order to improve locality for hub vertices that dedicate a large portion of the processing time to themselves.

We introduce in-Hub Temporal Locality (iHTL) as a structure-aware and cache-friendly graph traversal that optimizes locality in pull traversal. iHTL identifies different blocks in the adjacency matrix of a graph and applies a suitable traversal direction (push or pull) for each block based on its contents. In other words, iHTL optimizes locality of one traversal of all edges of the graph by:

(1) applying push direction for flipped blocks containing edges to in-hubs, and (2) applying pull direction for processing sparse block containing edges to non-hubs.

Moreover, iHTL introduces a new algorithm to efficiently identify the number of flipped blocks by investigating connection between hub vertices of the graph. This allows iHTL to create flipped blocks as much as the graph structure requires and makes iHTL suitable for a wide range of different real-world graph datasets like social networks and web graphs.

iHTL is 1.5× – 2.4× faster than pull and 4.8× – 9.5× faster than push in state-of-the-art graph processing frameworks. More importantly, iHTL is 1.3× – 1.5× faster than pull traversal of state-of-the-art locality optimizing reordering algorithms such as SlashBurn, GOrder, and Rabbit-Order while reduces the preprocessing time by 780×, on average.

Code Availability The source-code will be published soon.

BibTex

@INPROCEEDINGS{10.1145/3472456.3472462,
author = {Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
title = {Exploiting In-Hub Temporal Locality In SpMV-Based Graph Processing},
year = {2021},
isbn = {9781450390682},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3472456.3472462},
doi = {10.1145/3472456.3472462},
booktitle = {50th International Conference on Parallel Processing},
numpages = {10},
location = {Lemont, IL, USA},
series = {ICPP 2021}
}

Relabeling (reordering) algorithms aim to improve the poor memory locality of graph processing by changing the order of vertices. This paper analyses the functionality of three state-of-the-art relabeling algorithms: SlashBurn, GOrder, and Rabbit-Order for real-world graphs.

We use a number of techniques to explain how locality is affected by relabeling algorithms and how locality of different datasets (like social networks and web graphs) is enhanced by relabeling algorithms.

We use last level cache simulation to study miss rate degree distribution. We also use the degree distribution of Giant Connected Component (GCC) in SlashBurn iterations to see if real-world graphs follow the assumption that “power-law graphs are created/destroyed recursively” [SlashBurn]. We represent SlashBurn++ as an enhanced version of SlashBurn with lower preprocessing time and better locality.

Using cache simulation, we count the number of misses for accessing vertices data of high-degree vertices. This helps to explain how GOrder provides better temporal locality by managing cache space. Average ID Distance (AID) is a spatial locality metric introduced in this paper to explain how clustering relabeling algorithms like Rabbit-Order provide better spatial locality.

This paper also investigates why push and pull traversals of different datasets show different performances by introducing Push Locality and Pull Locality.

Code Availability The LaganLighter source-code will be published soon.

BibTex

@INPROCEEDINGS{10.1109/ISPASS51385.2021.00023,
author={Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
booktitle={2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)},
title={How Do Graph Relabeling Algorithms Improve Memory Locality?},
year={2021},
volume={},
number={},
pages={84-86},
publisher={IEEE Computer Society},
doi={10.1109/ISPASS51385.2021.00023}
}

This work proposes Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO balances edges and vertices for graphs with a power-law degree distribution, and ensures an equal degree distribution between partitions. Experimental evaluation on three shared-memory graph processing systems (Ligra, Polymer and GraphGrind) shows that VEBO achieves excellent load balance and improves performance by 1.09× over Ligra, 1.41× over Polymer and 1.65× over GraphGrind, compared to their respective partitioning algorithms, averaged across 8 algorithms and 7 graphs. VEBO improves GraphGrind performance with a speedup of 2.9× over Ligra on average.

Graph partitioning drives graph processing in distributed, disk-based and NUMA-aware systems. A commonly used partitioning goal is to balance the number of edges per partition in conjunction with minimizing the edge or vertex cut. While this type of partitioning is computationally expensive, we observe that such topology-driven partitioning nonetheless results in computational load imbalance. We propose Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO optimally balances edges and vertices for graphs with a power-law degree distribution. Experimental evaluation on three shared-memory graph processing systems (Ligra, Polymer and GraphGrind) shows that VEBO achieves excellent load balance and improves performance by 1.09x over Ligra, 1.41x over Polymer and 1.65x over GraphGrind, compared to their respective partitioning algorithms, averaged across 8 algorithms and 7 graphs.

This paper investigates how to improve the memory locality of graph-structured analytics on large-scale shared memory systems. We demonstrate that a graph partitioning where all in-edges for a vertex are placed in the same partition improves memory locality. However, realising performance improvement through such graph partitioning poses several challenges and requires rethinking the classification of graph algorithms and preferred data structures. We introduce the notion of medium dense frontiers, a type of frontier that is sufficiently dense for a bitmap representation, yet benefits from an indexed graph layout. Using three types of frontiers, and three graph layout schemes optimized to each frontier type, we design an edge traversal algorithm that autonomously decides which type to use. The distinction of forward vs. backward graph traversal folds into this decision and need no longer be specified by the programmer.We have implemented our techniques in a NUMA-aware graph analytics framework derived from Ligra and demonstrate a speedup of up to 4.34× over Ligra and up to 2.93× over Polymer.

We investigate how graph partitioning adversely affects the performance of graph analytics. We demonstrate that graph partitioning induces extra work during graph traversal and that graph partitions have markedly different connectivity than the original graph. By consequence, increasing the number of partitions reaches a tipping point after which overheads quickly dominate performance gains. Moreover, we show that the heuristic to balance CPU load between graph partitions by balancing the number of edges is inappropriate for a range of graph analyses. However, even when it is appropriate, it is sub-optimal due to the skewed degree distribution of social networks. Based on these observations, we propose GraphGrind, a new graph analytics system that addresses the limitations incurred by graph partitioning. We moreover propose a NUMA-aware extension to the Cilk programming language and obtain a scale-free yet NUMA-aware parallel programming environment which underpins NUMA-aware scheduling in GraphGrind. We demonstrate that Graph-Grind outperforms state-of-the-art graph analytics systems for shared memory including Ligra, Polymer and Galois.