Half-Precision Floating-Point Formats for PageRank: Opportunities and Challenges

https://doi.org/10.1109/HPEC43674.2020.9286179

Mixed-precision computation has been proposed as a means to accelerate iterative algorithms as it can reduce the memory bandwidth and cache effectiveness. This paper aims for further memory traffic reduction via introducing new half-precision (16 bit) data formats customized for PageRank. We develop two formats. A first format builds on the observation that the exponents of about 99% of PageRank values are tightly distributed around the exponent of the inverse of the number of vertices. A second format builds on the observation that 6 exponent bits are sufficient to capture the full dynamic range of PageRank values. Our floating-point formats provide less precision compared to standard IEEE 754 formats, but sufficient dynamic range for PageRank. The experimental results on various size graphs show that the proposed formats can achieve an accuracy of 1e-4., which is an improvement over the state of the art. Due to random memory access patterns in the algorithm, performance improvements over our highly tuned baseline are 1.5% at best.

Graptor: Efficient Pull and Push Style Vectorized Graph Processing

https://doi.org/10.1145/3392717.3392753

Vectorization seeks to accelerate computation through data-level parallelism. Vectorization has been applied to graph processing, where the graph is traversed either in a push style or a pull style. As it is not well understood which style will perform better, there is a need for both vectorized push and pull style traversals. This paper is the first to present a general solution to vectorizing push style traversal. It more-over presents an enhanced vectorized pull style traversal.

Our solution consists of three components: CleanCut, a graph partitioning approach that rules out inter-thread race conditions; VectorFast, a compact graph representation that supports fast-forwarding through the edge stream; and Graptor, a domain-specific language and compiler for auto-vectorizing and optimizing graph processing codes.

Experimental evaluation demonstrates average speedups of 2.72X over Ligra, 2.46X over GraphGrind, and 2.33X over GraphIt. Graptor outperforms Grazelle, which performs vectorized pull style graph processing, by 4.05 times.

    VEBO: a vertex- and edge-balanced ordering heuristic to load balance parallel graph processing (Poster)

    https://doi.org/10.1145/3293883.3295703

    This work proposes Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO balances edges and vertices for graphs with a power-law degree distribution, and ensures an equal degree distribution between partitions. Experimental evaluation on three shared-memory graph processing systems (Ligra, Polymer and GraphGrind) shows that VEBO achieves excellent load balance and improves performance by 1.09× over Ligra, 1.41× over Polymer and 1.65× over GraphGrind, compared to their respective partitioning algorithms, averaged across 8 algorithms and 7 graphs. VEBO improves GraphGrind performance with a speedup of 2.9× over Ligra on average.

    Graph processing lecture at MaRIONet Summer School

    https://manycore.org.uk/summerschool.html

    The Manycore Summer School gives researchers an opportunity to learn theory and practice in a range of emerging manycore technologies, from seven world-leading academic and industrial researchers. Participants engaged with cutting-edge material in lectures, hands-on labs, and interactive poster sessions. The Manycore Summer School was held from Monday 16th to Friday 20th July 2018 at the University of Glasgow.

    The slides are available here:

    VEBO: A Vertex- and Edge-Balanced Ordering Heuristic to Load Balance Parallel Graph Processing

    https://arxiv.org/abs/1806.06576

    Graph partitioning drives graph processing in distributed, disk-based and NUMA-aware systems. A commonly used partitioning goal is to balance the number of edges per partition in conjunction with minimizing the edge or vertex cut. While this type of partitioning is computationally expensive, we observe that such topology-driven partitioning nonetheless results in computational load imbalance. We propose Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO optimally balances edges and vertices for graphs with a power-law degree distribution. Experimental evaluation on three shared-memory graph processing systems (Ligra, Polymer and GraphGrind) shows that VEBO achieves excellent load balance and improves performance by 1.09x over Ligra, 1.41x over Polymer and 1.65x over GraphGrind, compared to their respective partitioning algorithms, averaged across 8 algorithms and 7 graphs.

    Accelerating Graph Analytics by Utilising the Memory Locality of Graph Partitioning

    https://doi.org/10.1109/ICPP.2017.27

    This paper investigates how to improve the memory locality of graph-structured analytics on large-scale shared memory systems. We demonstrate that a graph partitioning where all in-edges for a vertex are placed in the same partition improves memory locality. However, realising performance improvement through such graph partitioning poses several challenges and requires rethinking the classification of graph algorithms and preferred data structures. We introduce the notion of medium dense frontiers, a type of frontier that is sufficiently dense for a bitmap representation, yet benefits from an indexed graph layout. Using three types of frontiers, and three graph layout schemes optimized to each frontier type, we design an edge traversal algorithm that autonomously decides which type to use. The distinction of forward vs. backward graph traversal folds into this decision and need no longer be specified by the programmer.We have implemented our techniques in a NUMA-aware graph analytics framework derived from Ligra and demonstrate a speedup of up to 4.34× over Ligra and up to 2.93× over Polymer.

    Talk at UKMAC 2017

    On July 11th, Hans gave a talk on “GraphGrind: Taming Irregular Memory Accesses in Graph Analytics Workloads” at the UK ManyCore workshop https://manycore.org.uk/ukmac2017.html

    The analysis of graph-structured data is gaining importance due to its relevance to social media and big data. Due to the interconnection patterns in social network graphs, the performance of graph analytics is impeded by irregular memory accesses patterns which expose memory latency. 
    This talk presents our recent work on high-performance graph analytics. We will demonstrate how graph partitioning is crucial to tame memory locality and how it can be used to map graph analytics to non-uniform memory architectures (NUMA). Key to the graph partitioning algorithm is that it achieves memory locality, avoids overlap in write-sets between threads and is efficient to apply. We will discuss the difficulties of making graph partitioning scalable as a result of a strongly biased degree distribution in the partitions. We will demonstrate solutions to these problems. We will moreover identify new opportunities to switch between different representations of the graph during graph traversal in order to maximise processing speed. 
    These ideas are implemented in GraphGrind, an open source framework for graph analytics on shared memory systems.