Scheduling Fine-Grain Loops in Graph Processing Workloads

Scheduling or distributing the computational workload over multiple threads is a critical and repeatedly performed activity in graph processing workloads. In a recent paper “Reducing the burden of parallel loop schedulers for many‐core processors” published in Concurrency & Computation: Practice & Experience, we investigated the overhead introduced by scheduling. This overhead follows from two effects: (i) threads require to communicate and arrive at the same point in the program at the same time; (ii) inter-thread communication incurs significant cache misses and coherence messages sent between processors. We have likened the work distribution to barrier synchronisation and observed that state-of-the-art parallel schedulers such as the Intel OpenMP runtime and Intel Cilkplus incur the cost of a full-barrier synchronisation at the start of a parallel loop and at the end of the loop. The below figure illustrates the synchronisation pattern:

A barrier synchronisation is a synchronisation mechanism that waits for all threads to arrive at the barrier, then signals each thread they may continue execution. If we look in more detail at a barrier, it consists of two phases: a join phase and a release phase:

However, this introduces redundant synchronisation. It suffices to place only a half-barrier synchronisation at the start of the loop, and the other half at the end of the loop. Schematically, this looks like this:

Based on this observation, we designed an optimised scheduling technique that works specifically well for fine-grain loops, which are typically counted loops with very short loop bodies.

Using our optimised scheduler, fine-grain loops in graph processing applications can be sped up by 21.6% to 29.6%. The below figure shows a histogram of the performance obtained for the fine-grain loops in the betweenness centrality kernel (BC). This evaluation was performed on a four-socket 2.6 GHz Intel Xeon E7-4860 v2 machine with 12 physical cores per socket (plus hyperthreading) and30 MB L3 cache per socket. The baseline uses the Intel Cilkplus scheduler, while hybrid demonstrates performance of a hybrid version of the Cilkplus scheduler which can execute a mixture of coarse-grain loops (scheduled using the normal Cilkplus policy) and fine-grain loops using our optimised scheduler.

As graph processing applications contain a mix of fine-grain and coarse-grain loop, overall speedups in these applications is below 5%.

More details can be found in the paper, published under Open Access: https://onlinelibrary.wiley.com/doi/10.1002/cpe.6241

VEBO: a vertex- and edge-balanced ordering heuristic to load balance parallel graph processing (Poster)

https://doi.org/10.1145/3293883.3295703

This work proposes Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO balances edges and vertices for graphs with a power-law degree distribution, and ensures an equal degree distribution between partitions. Experimental evaluation on three shared-memory graph processing systems (Ligra, Polymer and GraphGrind) shows that VEBO achieves excellent load balance and improves performance by 1.09× over Ligra, 1.41× over Polymer and 1.65× over GraphGrind, compared to their respective partitioning algorithms, averaged across 8 algorithms and 7 graphs. VEBO improves GraphGrind performance with a speedup of 2.9× over Ligra on average.

Graph processing lecture at MaRIONet Summer School

https://manycore.org.uk/summerschool.html

The Manycore Summer School gives researchers an opportunity to learn theory and practice in a range of emerging manycore technologies, from seven world-leading academic and industrial researchers. Participants engaged with cutting-edge material in lectures, hands-on labs, and interactive poster sessions. The Manycore Summer School was held from Monday 16th to Friday 20th July 2018 at the University of Glasgow.

The slides are available here:

VEBO: A Vertex- and Edge-Balanced Ordering Heuristic to Load Balance Parallel Graph Processing

https://arxiv.org/abs/1806.06576

Graph partitioning drives graph processing in distributed, disk-based and NUMA-aware systems. A commonly used partitioning goal is to balance the number of edges per partition in conjunction with minimizing the edge or vertex cut. While this type of partitioning is computationally expensive, we observe that such topology-driven partitioning nonetheless results in computational load imbalance. We propose Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO optimally balances edges and vertices for graphs with a power-law degree distribution. Experimental evaluation on three shared-memory graph processing systems (Ligra, Polymer and GraphGrind) shows that VEBO achieves excellent load balance and improves performance by 1.09x over Ligra, 1.41x over Polymer and 1.65x over GraphGrind, compared to their respective partitioning algorithms, averaged across 8 algorithms and 7 graphs.

Accelerating Graph Analytics by Utilising the Memory Locality of Graph Partitioning

https://doi.org/10.1109/ICPP.2017.27

This paper investigates how to improve the memory locality of graph-structured analytics on large-scale shared memory systems. We demonstrate that a graph partitioning where all in-edges for a vertex are placed in the same partition improves memory locality. However, realising performance improvement through such graph partitioning poses several challenges and requires rethinking the classification of graph algorithms and preferred data structures. We introduce the notion of medium dense frontiers, a type of frontier that is sufficiently dense for a bitmap representation, yet benefits from an indexed graph layout. Using three types of frontiers, and three graph layout schemes optimized to each frontier type, we design an edge traversal algorithm that autonomously decides which type to use. The distinction of forward vs. backward graph traversal folds into this decision and need no longer be specified by the programmer.We have implemented our techniques in a NUMA-aware graph analytics framework derived from Ligra and demonstrate a speedup of up to 4.34× over Ligra and up to 2.93× over Polymer.

Talk at UKMAC 2017

On July 11th, Hans gave a talk on “GraphGrind: Taming Irregular Memory Accesses in Graph Analytics Workloads” at the UK ManyCore workshop https://manycore.org.uk/ukmac2017.html

The analysis of graph-structured data is gaining importance due to its relevance to social media and big data. Due to the interconnection patterns in social network graphs, the performance of graph analytics is impeded by irregular memory accesses patterns which expose memory latency. 
This talk presents our recent work on high-performance graph analytics. We will demonstrate how graph partitioning is crucial to tame memory locality and how it can be used to map graph analytics to non-uniform memory architectures (NUMA). Key to the graph partitioning algorithm is that it achieves memory locality, avoids overlap in write-sets between threads and is efficient to apply. We will discuss the difficulties of making graph partitioning scalable as a result of a strongly biased degree distribution in the partitions. We will demonstrate solutions to these problems. We will moreover identify new opportunities to switch between different representations of the graph during graph traversal in order to maximise processing speed. 
These ideas are implemented in GraphGrind, an open source framework for graph analytics on shared memory systems.

GraphGrind: Addressing Load Imbalance of Graph Partitioning

https://doi.org/10.1145/3079079.3079097

We investigate how graph partitioning adversely affects the performance of graph analytics. We demonstrate that graph partitioning induces extra work during graph traversal and that graph partitions have markedly different connectivity than the original graph. By consequence, increasing the number of partitions reaches a tipping point after which overheads quickly dominate performance gains. Moreover, we show that the heuristic to balance CPU load between graph partitions by balancing the number of edges is inappropriate for a range of graph analyses. However, even when it is appropriate, it is sub-optimal due to the skewed degree distribution of social networks. Based on these observations, we propose GraphGrind, a new graph analytics system that addresses the limitations incurred by graph partitioning. We moreover propose a NUMA-aware extension to the Cilk programming language and obtain a scale-free yet NUMA-aware parallel programming environment which underpins NUMA-aware scheduling in GraphGrind. We demonstrate that Graph-Grind outperforms state-of-the-art graph analytics systems for shared memory including Ligra, Polymer and Galois.