Graptor: Efficient Pull and Push Style Vectorized Graph Processing

https://doi.org/10.1145/3392717.3392753

Vectorization seeks to accelerate computation through data-level parallelism. Vectorization has been applied to graph processing, where the graph is traversed either in a push style or a pull style. As it is not well understood which style will perform better, there is a need for both vectorized push and pull style traversals. This paper is the first to present a general solution to vectorizing push style traversal. It more-over presents an enhanced vectorized pull style traversal.

Our solution consists of three components: CleanCut, a graph partitioning approach that rules out inter-thread race conditions; VectorFast, a compact graph representation that supports fast-forwarding through the edge stream; and Graptor, a domain-specific language and compiler for auto-vectorizing and optimizing graph processing codes.

Experimental evaluation demonstrates average speedups of 2.72X over Ligra, 2.46X over GraphGrind, and 2.33X over GraphIt. Graptor outperforms Grazelle, which performs vectorized pull style graph processing, by 4.05 times.

    VEBO: a vertex- and edge-balanced ordering heuristic to load balance parallel graph processing (Poster)

    https://doi.org/10.1145/3293883.3295703

    This work proposes Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO balances edges and vertices for graphs with a power-law degree distribution, and ensures an equal degree distribution between partitions. Experimental evaluation on three shared-memory graph processing systems (Ligra, Polymer and GraphGrind) shows that VEBO achieves excellent load balance and improves performance by 1.09× over Ligra, 1.41× over Polymer and 1.65× over GraphGrind, compared to their respective partitioning algorithms, averaged across 8 algorithms and 7 graphs. VEBO improves GraphGrind performance with a speedup of 2.9× over Ligra on average.

    Graph processing lecture at MaRIONet Summer School

    https://manycore.org.uk/summerschool.html

    The Manycore Summer School gives researchers an opportunity to learn theory and practice in a range of emerging manycore technologies, from seven world-leading academic and industrial researchers. Participants engaged with cutting-edge material in lectures, hands-on labs, and interactive poster sessions. The Manycore Summer School was held from Monday 16th to Friday 20th July 2018 at the University of Glasgow.

    The slides are available here:

    VEBO: A Vertex- and Edge-Balanced Ordering Heuristic to Load Balance Parallel Graph Processing

    https://arxiv.org/abs/1806.06576

    Graph partitioning drives graph processing in distributed, disk-based and NUMA-aware systems. A commonly used partitioning goal is to balance the number of edges per partition in conjunction with minimizing the edge or vertex cut. While this type of partitioning is computationally expensive, we observe that such topology-driven partitioning nonetheless results in computational load imbalance. We propose Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO optimally balances edges and vertices for graphs with a power-law degree distribution. Experimental evaluation on three shared-memory graph processing systems (Ligra, Polymer and GraphGrind) shows that VEBO achieves excellent load balance and improves performance by 1.09x over Ligra, 1.41x over Polymer and 1.65x over GraphGrind, compared to their respective partitioning algorithms, averaged across 8 algorithms and 7 graphs.

    Accelerating Graph Analytics by Utilising the Memory Locality of Graph Partitioning

    https://doi.org/10.1109/ICPP.2017.27

    This paper investigates how to improve the memory locality of graph-structured analytics on large-scale shared memory systems. We demonstrate that a graph partitioning where all in-edges for a vertex are placed in the same partition improves memory locality. However, realising performance improvement through such graph partitioning poses several challenges and requires rethinking the classification of graph algorithms and preferred data structures. We introduce the notion of medium dense frontiers, a type of frontier that is sufficiently dense for a bitmap representation, yet benefits from an indexed graph layout. Using three types of frontiers, and three graph layout schemes optimized to each frontier type, we design an edge traversal algorithm that autonomously decides which type to use. The distinction of forward vs. backward graph traversal folds into this decision and need no longer be specified by the programmer.We have implemented our techniques in a NUMA-aware graph analytics framework derived from Ligra and demonstrate a speedup of up to 4.34× over Ligra and up to 2.93× over Polymer.

    Talk at UKMAC 2017

    On July 11th, Hans gave a talk on “GraphGrind: Taming Irregular Memory Accesses in Graph Analytics Workloads” at the UK ManyCore workshop https://manycore.org.uk/ukmac2017.html

    The analysis of graph-structured data is gaining importance due to its relevance to social media and big data. Due to the interconnection patterns in social network graphs, the performance of graph analytics is impeded by irregular memory accesses patterns which expose memory latency. 
    This talk presents our recent work on high-performance graph analytics. We will demonstrate how graph partitioning is crucial to tame memory locality and how it can be used to map graph analytics to non-uniform memory architectures (NUMA). Key to the graph partitioning algorithm is that it achieves memory locality, avoids overlap in write-sets between threads and is efficient to apply. We will discuss the difficulties of making graph partitioning scalable as a result of a strongly biased degree distribution in the partitions. We will demonstrate solutions to these problems. We will moreover identify new opportunities to switch between different representations of the graph during graph traversal in order to maximise processing speed. 
    These ideas are implemented in GraphGrind, an open source framework for graph analytics on shared memory systems.

    GraphGrind: Addressing Load Imbalance of Graph Partitioning

    https://doi.org/10.1145/3079079.3079097

    We investigate how graph partitioning adversely affects the performance of graph analytics. We demonstrate that graph partitioning induces extra work during graph traversal and that graph partitions have markedly different connectivity than the original graph. By consequence, increasing the number of partitions reaches a tipping point after which overheads quickly dominate performance gains. Moreover, we show that the heuristic to balance CPU load between graph partitions by balancing the number of edges is inappropriate for a range of graph analyses. However, even when it is appropriate, it is sub-optimal due to the skewed degree distribution of social networks. Based on these observations, we propose GraphGrind, a new graph analytics system that addresses the limitations incurred by graph partitioning. We moreover propose a NUMA-aware extension to the Cilk programming language and obtain a scale-free yet NUMA-aware parallel programming environment which underpins NUMA-aware scheduling in GraphGrind. We demonstrate that Graph-Grind outperforms state-of-the-art graph analytics systems for shared memory including Ligra, Polymer and Galois.