Scheduling Fine-Grain Loops in Graph Processing Workloads

Scheduling or distributing the computational workload over multiple threads is a critical and repeatedly performed activity in graph processing workloads. In a recent paper “Reducing the burden of parallel loop schedulers for many‐core processors” published in Concurrency & Computation: Practice & Experience, we investigated the overhead introduced by scheduling. This overhead follows from two effects: (i) threads require to communicate and arrive at the same point in the program at the same time; (ii) inter-thread communication incurs significant cache misses and coherence messages sent between processors. We have likened the work distribution to barrier synchronisation and observed that state-of-the-art parallel schedulers such as the Intel OpenMP runtime and Intel Cilkplus incur the cost of a full-barrier synchronisation at the start of a parallel loop and at the end of the loop. The below figure illustrates the synchronisation pattern:

A barrier synchronisation is a synchronisation mechanism that waits for all threads to arrive at the barrier, then signals each thread they may continue execution. If we look in more detail at a barrier, it consists of two phases: a join phase and a release phase:

However, this introduces redundant synchronisation. It suffices to place only a half-barrier synchronisation at the start of the loop, and the other half at the end of the loop. Schematically, this looks like this:

Based on this observation, we designed an optimised scheduling technique that works specifically well for fine-grain loops, which are typically counted loops with very short loop bodies.

Using our optimised scheduler, fine-grain loops in graph processing applications can be sped up by 21.6% to 29.6%. The below figure shows a histogram of the performance obtained for the fine-grain loops in the betweenness centrality kernel (BC). This evaluation was performed on a four-socket 2.6 GHz Intel Xeon E7-4860 v2 machine with 12 physical cores per socket (plus hyperthreading) and30 MB L3 cache per socket. The baseline uses the Intel Cilkplus scheduler, while hybrid demonstrates performance of a hybrid version of the Cilkplus scheduler which can execute a mixture of coarse-grain loops (scheduled using the normal Cilkplus policy) and fine-grain loops using our optimised scheduler.

As graph processing applications contain a mix of fine-grain and coarse-grain loop, overall speedups in these applications is below 5%.

More details can be found in the paper, published under Open Access: https://onlinelibrary.wiley.com/doi/10.1002/cpe.6241

Invited Talk: Metaprogramming in Jupyter Notebooks – Dr Jeremy Singer

Dr Jeremy Singer, University of Glasgow
25 February 2021

Abstract:
Love ’em or hate ’em, interactive computational notebooks are here to stay as a mainstream code development medium. In particular, the Jupyter system is widely used by the data science community. This presentation explores some use cases for programmatic introspection of a Jupyter notebook from within a notebook itself. We sketch a possible reflection API for Jupyter and describe how its implementation is complicated by the under-the-hood message flows of the Jupyter distributed system architecture.


Bio:
Jeremy Singer is a senior lecturer in the School of Computing Science at the University of Glasgow, where he has worked for the past 10 years. Jeremy’s research interests include programming language compilers and runtimes, memory management, manycore parallelism, and distributed systems. He currently co-leads the EPSRC-funded Capable VMs project. Jeremy is the author of the textbook “Operating System Foundations with Linux on the Raspberry Pi” and lead educator of the “Functional Programming in Haskell” massive open online course.

Invited Talk: Addressing Practical HPC Problems: Fault Tolerance, Performance Portability, Parallelism Compilation – Dr Giorgis Georgakoudis

Dr Giorgis Georgakoudis, Lawrence Livermore National Laboratory
18 February 2021

Abstract:

This talk will present an overview of research on different areas of open problems in HPC. On fault tolerance, Giorgis will present the Reinit solution for fault tolerance in large scale MPI applications. Reinit improves the recovery time of checkpointed MPI applications by avoiding MPI re-deployment on restart, extending instead the MPI runtime to repair itself at runtime. On performance portability, Giorgis will present the auto-tuning framework Apollo, which provides an API for tuning execution parameters of code regions using machine learning at runtime. Lastly, Giorgis will talk on understanding deficiencies of parallelism compilation and of the approach to move forward for improving compiler optimizations for parallel programs.

Bio:

Giorgis Georgakoudis is a Computer Scientist in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. His research interests include fault tolerance, optimized compilation of parallel programs, and runtime performance characterization and tuning. He is currently involved in the Exascale Computing Project, designing and developing fault-tolerance abstractions for MPI, and in the Apollo project for dynamic tuning the performance of parallel programs. Also, since October 2020, Giorgis leads his own project on compiler optimizations for parallelism as the Principal Investigator, funded through the Lab Directed Research and Development (LDRD) Program of LLNL.

Giorgis obtained his Dipl. Eng. (2007), Master’s (2010), and PhD degrees (2017) from the Department of Electrical and Computer Eng. of University of Thessaly, Greece. From 2013 to 2018, he was also affiliated with Queen’s University Belfast, UK working as a researcher concurrently with his PhD studies. Since November 2018 Giorgis is working in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. He is also a member of ACM and IEEE societies, and frequently provides professional service as a reviewer in conferences and journals.

Invited Talk: Parallel Algorithms for Density-Based and Structural Clustering – Dr Julian Shun

Dr Julian Shun, Massachusetts Institute of Technology
14 January 2021

Abstract: This talk presents new parallel algorithms for density-based
spatial clustering (DBSCAN) on point sets and structural clustering
(SCAN) on graphs, two problems that have received significant
attention due to their applicability in a variety of data analysis
tasks.

Existing parallel algorithms for DBSCAN require much more work than
their sequential counterparts, making them inefficient for large
datasets.  We bridge the gap between theory and practice of parallel
DBSCAN by presenting new parallel algorithms for Euclidean exact
DBSCAN and approximate DBSCAN that match the work bounds of their
sequential counterparts, and are highly parallel (polylogarithmic
depth).  For graphs, we present the first parallel index-based SCAN
algorithm, based a recent sequential algorithm, which enables users to
efficiently explore many different parameter settings for cluster
generation. Our parallel algorithm has a better work bound than the
sequential algorithm, and achieves logarithmic depth. We also apply
locality-sensitive hashing to design a novel approximate SCAN
algorithm and prove guarantees for its clustering quality.  We present
optimized implementations of our algorithms which achieve good
parallel scalability and outperform existing parallel implementations
by up to several orders of magnitude.

Bio: Julian Shun is the Douglas T. Ross Career Development Assistant
Professor of Software Technology in EECS and CSAIL at MIT.  Prior to
coming to MIT, he was a Miller Research Fellow at UC Berkeley.  He
received his Ph.D. from Carnegie Mellon University and his B.A. from
UC Berkeley.  His research focuses on the theory and practice of
parallel algorithms and programming frameworks.  He has received the
NSF CAREER Award, DOE Early Career Award, ACM Doctoral Dissertation
Award, CMU School of Computer Science Doctoral Dissertation Award,
Facebook Graduate Fellowship, Google Faculty Research Award, and best 
paper awards at PLDI, SPAA, and DCC.

Half-Precision Floating-Point Formats for PageRank: Opportunities and Challenges

https://doi.org/10.1109/HPEC43674.2020.9286179

Mixed-precision computation has been proposed as a means to accelerate iterative algorithms as it can reduce the memory bandwidth and cache effectiveness. This paper aims for further memory traffic reduction via introducing new half-precision (16 bit) data formats customized for PageRank. We develop two formats. A first format builds on the observation that the exponents of about 99% of PageRank values are tightly distributed around the exponent of the inverse of the number of vertices. A second format builds on the observation that 6 exponent bits are sufficient to capture the full dynamic range of PageRank values. Our floating-point formats provide less precision compared to standard IEEE 754 formats, but sufficient dynamic range for PageRank. The experimental results on various size graphs show that the proposed formats can achieve an accuracy of 1e-4., which is an improvement over the state of the art. Due to random memory access patterns in the algorithm, performance improvements over our highly tuned baseline are 1.5% at best.

Graptor: Efficient Pull and Push Style Vectorized Graph Processing

https://doi.org/10.1145/3392717.3392753

Vectorization seeks to accelerate computation through data-level parallelism. Vectorization has been applied to graph processing, where the graph is traversed either in a push style or a pull style. As it is not well understood which style will perform better, there is a need for both vectorized push and pull style traversals. This paper is the first to present a general solution to vectorizing push style traversal. It more-over presents an enhanced vectorized pull style traversal.

Our solution consists of three components: CleanCut, a graph partitioning approach that rules out inter-thread race conditions; VectorFast, a compact graph representation that supports fast-forwarding through the edge stream; and Graptor, a domain-specific language and compiler for auto-vectorizing and optimizing graph processing codes.

Experimental evaluation demonstrates average speedups of 2.72X over Ligra, 2.46X over GraphGrind, and 2.33X over GraphIt. Graptor outperforms Grazelle, which performs vectorized pull style graph processing, by 4.05 times.

    VEBO: a vertex- and edge-balanced ordering heuristic to load balance parallel graph processing (Poster)

    https://doi.org/10.1145/3293883.3295703

    This work proposes Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO balances edges and vertices for graphs with a power-law degree distribution, and ensures an equal degree distribution between partitions. Experimental evaluation on three shared-memory graph processing systems (Ligra, Polymer and GraphGrind) shows that VEBO achieves excellent load balance and improves performance by 1.09× over Ligra, 1.41× over Polymer and 1.65× over GraphGrind, compared to their respective partitioning algorithms, averaged across 8 algorithms and 7 graphs. VEBO improves GraphGrind performance with a speedup of 2.9× over Ligra on average.

    Graph processing lecture at MaRIONet Summer School

    https://manycore.org.uk/summerschool.html

    The Manycore Summer School gives researchers an opportunity to learn theory and practice in a range of emerging manycore technologies, from seven world-leading academic and industrial researchers. Participants engaged with cutting-edge material in lectures, hands-on labs, and interactive poster sessions. The Manycore Summer School was held from Monday 16th to Friday 20th July 2018 at the University of Glasgow.

    The slides are available here:

    VEBO: A Vertex- and Edge-Balanced Ordering Heuristic to Load Balance Parallel Graph Processing

    https://arxiv.org/abs/1806.06576

    Graph partitioning drives graph processing in distributed, disk-based and NUMA-aware systems. A commonly used partitioning goal is to balance the number of edges per partition in conjunction with minimizing the edge or vertex cut. While this type of partitioning is computationally expensive, we observe that such topology-driven partitioning nonetheless results in computational load imbalance. We propose Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO optimally balances edges and vertices for graphs with a power-law degree distribution. Experimental evaluation on three shared-memory graph processing systems (Ligra, Polymer and GraphGrind) shows that VEBO achieves excellent load balance and improves performance by 1.09x over Ligra, 1.41x over Polymer and 1.65x over GraphGrind, compared to their respective partitioning algorithms, averaged across 8 algorithms and 7 graphs.