Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing – ICPP’21

ICPP'21

50th International Conference on Parallel Processing (ICPP’21)
August 9-12, 2021


DOI:10.1145/3472456.3472462

Acceptance Rate: 26.4%

This paper investigates the implications made by the structure of real-world graphs with power-law degree distribution on the locality of SpMV graph analytics, and by considering the efficacy of locality optimizing graph reordering algorithms (such as SlashBurn, GOrder, and Rabbit-Order) shows that irregular datasets requires special traversals in order to improve locality for hub vertices that dedicate a large portion of the processing time to themselves.

We introduce in-Hub Temporal Locality (iHTL) as a structure-aware and cache-friendly graph traversal that optimizes locality in pull traversal. iHTL identifies different blocks in the adjacency matrix of a graph and applies a suitable traversal direction (push or pull) for each block based on its contents. In other words, iHTL optimizes locality of one traversal of all edges of the graph by:

(1) applying push direction for flipped blocks containing edges to in-hubs, and
(2) applying pull direction for processing sparse block containing edges to non-hubs.

Moreover, iHTL introduces a new algorithm to efficiently identify the number of flipped blocks by investigating connection between hub vertices of the graph. This allows iHTL to create flipped blocks as much as the graph structure requires and makes iHTL suitable for a wide range of different real-world graph datasets like social networks and web graphs.

iHTL is 1.5× – 2.4× faster than pull and 4.8× – 9.5× faster than push in state-of-the-art graph processing frameworks. More importantly, iHTL is 1.3× – 1.5× faster than pull traversal of state-of-the-art locality optimizing reordering algorithms such as SlashBurn, GOrder, and Rabbit-Order while reduces the preprocessing time by 780×, on average.

Code Availability
The source-code will be published soon.

BibTex


LaganLighter

Related Posts

Thrifty Label Propagation: Fast Connected Components for Skewed-Degree Graphs – IEEE CLUSTER’21

IEEE CLUSTER 2021

IEEE CLUSTER 2021
7-10 September

Acceptance rate: 29.4%

This page is updating


Related Posts

Invited Talk – Enabling high-performance large-scale irregular computations by Maciej Besta

17 June 2021

Abstract:
Large graphs are behind many problems in today’s computing landscape. The
growing sizes of such graphs, reaching 70 trillion edges recently, require
unprecedented amounts of compute power, storage, and energy. In this talk, we
illustrate how to effectively process such extreme-scale graphs. We will first
discuss Slim Graph, the first approach for practical lossy graph compression,
that facilitates high-performance approximate graph processing, storage, and
analytics with strong theoretical guarantees on accuracy, for a broad set of
graph problems and algorithms. In the second part of the talk, we will focus on
a class of complex graph mining problems such as clique listing. Here, we first
show how to solve such problems on complex parallel architectures in a simple
and high-performance way. For this, we propose a novel set-centric paradigm,
where complex algorithms are broken down into simple set algebra building
blocks such as set intersection or union, which can be separately optimized,
executed, and scheduled. Moreover, after discussing how to effectively and
efficiently mine complex graph patterns, we will turn our attention to pattern
prediction. Specifically, we establish a general problem of motif prediction,
in which we generalize link prediction, one of central problems in graph
analytics, into predicting the arrival of arbitrary complex higher-order
structures called motifs. To solve motif prediction, we harness recent graph
neural network architectures.

Bio
Maciej is a researcher from Scalable Parallel Computing Lab at ETH Zurich. He works on large-scale irregular computations and high-performance networking. He received Best Paper awards and Best Student Paper awards at ACM/IEEE Supercomputing 2013, 2014, and 2019, at ACM HPDC 2015 and 2016, ACM Research Highlights 2018, and several further best paper nominations (ACM HPDC 2014, ACM FPGA 2019, and ACM/IEEE Supercomputing 2019). He won, among others, the competition for the Best Student of Poland (2012), the first Google Fellowship in Parallel Computing (2013), and the ACM/IEEE-CS George Michael HPC Fellowship (2015). More detailed information on a personal site: https://people.inf.ethz.ch/bestam/

Invited Talk: Adaptiveness and Lock-free Synchronization in Parallel Stochastic Gradient Descent by Karl Bäckström

3 June 2021

Abstract
The emergence of big data in recent years due to the vast societal digitalization and large-scale sensor deployment has entailed significant interest in machine learning methods to enable automatic data analytics. In a majority of the learning algorithms used in industrial as well as academic settings, the first-order iterative optimization procedure Stochastic gradient descent (SGD), is the backbone. However, SGD is often time-consuming, as it typically requires several passes through the entire dataset to converge to a solution of sufficient quality. In order to cope with increasing data volumes, and to facilitate accelerated processing utilizing contemporary hardware, various parallel SGD variants have been proposed. In addition to traditional synchronous parallelization schemes, asynchronous ones have received particular interest in recent literature due to their improved ability to scale due to less coordination, and subsequently waiting time. However, asynchrony implies inherent challenges in understanding the execution of the algorithm and its convergence properties, due the presence of both stale and inconsistent views of the shared state. In this work, we aim to increase the understanding of the convergence properties of SGD for practical applications under asynchronous parallelism and develop tools and frameworks that facilitate improved convergence properties as well as further research and development. First, we focus on understanding the impact of staleness, and introduce models for capturing the dynamics of parallel execution of SGD. This enables (i) quantifying the statistical penalty on the convergence due to staleness and (ii) deriving an adaptation scheme, introducing a staleness-adaptive SGD variant MindTheStep-AsyncSGD, which provably reduces this penalty. Second, we aim at exploring the impact of synchronization mechanisms, in particular consistency-preserving ones, and the overall effect on the convergence properties. To this end, we propose Leashed-SGD, an extensible algorithmic framework supporting various synchronization mechanisms for different degrees of consistency, enabling in particular a lock-free and consistency-preserving implementation. In addition, the algorithmic construction of Leashed-SGD enables dynamic memory allocation, claiming memory only when necessary, which reduces the overall memory footprint. We perform an extensive empirical study, benchmarking the proposed methods, together with established baselines, focusing on the prominent application of Deep Learning for image classification on the benchmark datasets MNIST and CIFAR, showing significant improvements in converge time for Leashed-SGD and MindTheStep-AsyncSGD.


Bio:

Karl Bäckström is a Ph.D. student at the Distributed Computing and Systems group at Chalmers University of Technology in Sweden. Karl has an academic background in Mathematics, Computer Science, and Engineering physics, with an overarching interest in distributed and parallel computation, optimization, and machine learning. Karl’s research directions include adaptiveness, synchronization, and consistency in parallel algorithms for iterative optimization. At the 35th IEEE International Parallel and Distributed Processing Symposium, Karl with co-authors were awarded Best Paper Honorable Mention for the paper “Consistent Lock-free Parallel Stochastic Gradient Descent for Fast and Stable Convergence”. Karl lives in Gothenburg, a coastal city in western Sweden, together with his Swiss Shepherd Valdi, often enjoying their free time together in nature and wilderness, or at home playing the Piano.

Invited Talk – Efficient Parallel Graph Processing on GPU using Approximate Computing By Somesh Singh

Abstract
Graph algorithms are widely used in several application domains. It has been established that parallelizing graph algorithms is challenging. The parallelization issues get exacerbated when graphics processing unit (GPU) is used to execute graph algorithms. In particular, three important GPU-specific aspects affect performance: memory coalescing, memory latency, and thread divergence. We attempt to tame these challenges using approximate computing. We target graph applications on GPUs that can tolerate some degradation in the quality of the output, in exchange for obtaining the result faster. We propose three techniques for boosting the performance of graph processing on the GPU by injecting approximations in a controlled manner. The first one creates a graph isomorph that brings relevant nodes nearby in memory and adds a controlled replica of nodes to improve coalescing. The second reduces memory latency by facilitating the processing of subgraphs inside shared memory by adding edges among specific nodes and processing well-connected subgraphs iteratively inside shared memory. The third technique normalizes degrees across nodes assigned to a warp to reduce thread divergence. Each technique offers notable performance benefits and provides a knob to control inaccuracy added to an execution. We demonstrate the effectiveness of the proposed techniques using a suite of large graphs with varied characteristics and five popular graph algorithms.


Bio
Somesh Singh is a Ph.D. candidate in the Dept. of CSE at the Indian Institute of Technology Madras, India. His research interests span the areas of high-performance computing, parallel computing, and graph analytics. His dissertation research focuses on designing techniques for improving the efficiency of parallel graph analytics on graphics processing unit (GPU) by trading-off computational accuracy. His PhD works have been accepted for publication at ICPP 2020, PPoPP 2019 (poster) and TMSCS 2018. He was a research intern at Intel Labs, India  and Microsoft Research, India in 2020. He was a Google Summer of Code participant with CERN-HSF in 2017 and 2018.

Invited Talk – Parallel Graph Learning and Computational Biology Through Sparse Matrices by Dr Aydın Buluç

Dr Aydın Buluç
29 April 2021

Solving systems of linear equations have traditionally driven the research in sparse matrix computation for decades. Direct and iterative solvers, together with finite element computations, still account for the primary use case for sparse matrix data structures and algorithms. These sparse “solvers” often serve as the workhorse of many algorithms in spectral graph theory and traditional machine learning. 

In this talk, I will be highlighting two of the emerging use cases of sparse matrices outside the domain of solvers: graph representation learning methods such as graph neural networks (GNNs) and graph kernels, and computational biology problems such as de novo genome assembly and protein family detection. A recurring theme in these novel use cases is the concept of a semiring on which the sparse matrix computations are carried out. By overloading scalar addition and multiplication operators of a semiring, we can attack a much richer set of computational problems using the same sparse data structures and algorithms. This approach has been formalized by the GraphBLAS effort. I will illustrate one example application from each problem domain, together with the most computationally demanding sparse matrix primitive required for its efficient execution. I will also cover novel parallel algorithms for these sparse matrix primitives and available software that implement them efficiently on various architectures.


Aydın Buluç is a Staff Scientist and Principal Investigator at the Lawrence Berkeley National Laboratory (LBNL) and an Adjunct Assistant Professor of EECS at UC Berkeley. His research interests include parallel computing, combinatorial scientific computing, high performance graph analysis and machine learning, sparse matrix computations, and computational biology. Previously, he was a Luis W. Alvarez postdoctoral fellow at LBNL and a visiting scientist at the Simons Institute for the Theory of Computing. He received his PhD in Computer Science from the University of California, Santa Barbara in 2010 and his BS in Computer Science and Engineering from Sabanci University, Turkey in 2005. Dr. Buluç is a recipient of the DOE Early Career Award in 2013 and the IEEE TCSC Award for Excellence for Early Career Researchers in 2015. He is also a founding associate editor of the ACM Transactions on Parallel Computing. As a graduate student, he spent a semester at the Mathematics Department of MIT, and a summer at the CSRI institute of Sandia National Labs, in New Mexico. He is a member of the SIAM and the ACM.

Scheduling Fine-Grain Loops in Graph Processing Workloads

Scheduling or distributing the computational workload over multiple threads is a critical and repeatedly performed activity in graph processing workloads. In a recent paper “Reducing the burden of parallel loop schedulers for many‐core processors” published in Concurrency & Computation: Practice & Experience, we investigated the overhead introduced by scheduling. This overhead follows from two effects: (i) threads require to communicate and arrive at the same point in the program at the same time; (ii) inter-thread communication incurs significant cache misses and coherence messages sent between processors. We have likened the work distribution to barrier synchronisation and observed that state-of-the-art parallel schedulers such as the Intel OpenMP runtime and Intel Cilkplus incur the cost of a full-barrier synchronisation at the start of a parallel loop and at the end of the loop. The below figure illustrates the synchronisation pattern:

A barrier synchronisation is a synchronisation mechanism that waits for all threads to arrive at the barrier, then signals each thread they may continue execution. If we look in more detail at a barrier, it consists of two phases: a join phase and a release phase:

However, this introduces redundant synchronisation. It suffices to place only a half-barrier synchronisation at the start of the loop, and the other half at the end of the loop. Schematically, this looks like this:

Based on this observation, we designed an optimised scheduling technique that works specifically well for fine-grain loops, which are typically counted loops with very short loop bodies.

Using our optimised scheduler, fine-grain loops in graph processing applications can be sped up by 21.6% to 29.6%. The below figure shows a histogram of the performance obtained for the fine-grain loops in the betweenness centrality kernel (BC). This evaluation was performed on a four-socket 2.6 GHz Intel Xeon E7-4860 v2 machine with 12 physical cores per socket (plus hyperthreading) and30 MB L3 cache per socket. The baseline uses the Intel Cilkplus scheduler, while hybrid demonstrates performance of a hybrid version of the Cilkplus scheduler which can execute a mixture of coarse-grain loops (scheduled using the normal Cilkplus policy) and fine-grain loops using our optimised scheduler.

As graph processing applications contain a mix of fine-grain and coarse-grain loop, overall speedups in these applications is below 5%.

More details can be found in the paper, published under Open Access: https://onlinelibrary.wiley.com/doi/10.1002/cpe.6241

How Do Graph Relabeling Algorithms Improve Memory Locality? ISPASS’21

ispass2021-how-do-relaebling-algorithms-improve-memory-locality

IEEE Xplore (DOI: 10.1109/ISPASS51385.2021.00023)
ISPASS-2021
2021 IEEE International Symposium on Performance Analysis of Systems and Software
March 28-30, 2021

Authors’ Copy in PDF Format

Relabeling (reordering) algorithms aim to improve the poor memory locality of graph processing by changing the order of vertices. This paper analyses the functionality of three state-of-the-art relabeling algorithms: SlashBurn, GOrder, and Rabbit-Order for real-world graphs.

We use a number of techniques to explain how locality is affected by relabeling algorithms and how locality of different datasets (like social networks and web graphs) is enhanced by relabeling algorithms.

We use last level cache simulation to study miss rate degree distribution. We also use the degree distribution of Giant Connected Component (GCC) in SlashBurn iterations to see if real-world graphs follow the assumption that “power-law graphs are created/destroyed recursively” [SlashBurn]. We represent SlashBurn++ as an enhanced version of SlashBurn with lower preprocessing time and better locality.

Using cache simulation, we count the number of misses for accessing vertices data of high-degree vertices. This helps to explain how GOrder provides better temporal locality by managing cache space. Average ID Distance (AID) is a spatial locality metric introduced in this paper to explain how clustering relabeling algorithms like Rabbit-Order provide better spatial locality.

This paper also investigates why push and pull traversals of different datasets show different performances by introducing Push Locality and Pull Locality.

Code Availability
The LaganLighter source-code will be published soon.

BibTex

@INPROCEEDINGS{9408196,
  author={Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
  booktitle={2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)}, 
  title={How Do Graph Relabeling Algorithms Improve Memory Locality?}, 
  year={2021},
  volume={},
  number={},
  pages={84-86},
  publisher={IEEE Computer Society},
  doi={10.1109/ISPASS51385.2021.00023}
}

ISPASS’21
ISPASS’21 Final Program
LaganLighter

Related Posts

Invited Talk: Metaprogramming in Jupyter Notebooks – Dr Jeremy Singer

Dr Jeremy Singer, University of Glasgow
25 February 2021

Abstract:
Love ’em or hate ’em, interactive computational notebooks are here to stay as a mainstream code development medium. In particular, the Jupyter system is widely used by the data science community. This presentation explores some use cases for programmatic introspection of a Jupyter notebook from within a notebook itself. We sketch a possible reflection API for Jupyter and describe how its implementation is complicated by the under-the-hood message flows of the Jupyter distributed system architecture.


Bio:
Jeremy Singer is a senior lecturer in the School of Computing Science at the University of Glasgow, where he has worked for the past 10 years. Jeremy’s research interests include programming language compilers and runtimes, memory management, manycore parallelism, and distributed systems. He currently co-leads the EPSRC-funded Capable VMs project. Jeremy is the author of the textbook “Operating System Foundations with Linux on the Raspberry Pi” and lead educator of the “Functional Programming in Haskell” massive open online course.

Invited Talk: Addressing Practical HPC Problems: Fault Tolerance, Performance Portability, Parallelism Compilation – Dr Giorgis Georgakoudis

Dr Giorgis Georgakoudis, Lawrence Livermore National Laboratory
18 February 2021

Abstract:

This talk will present an overview of research on different areas of open problems in HPC. On fault tolerance, Giorgis will present the Reinit solution for fault tolerance in large scale MPI applications. Reinit improves the recovery time of checkpointed MPI applications by avoiding MPI re-deployment on restart, extending instead the MPI runtime to repair itself at runtime. On performance portability, Giorgis will present the auto-tuning framework Apollo, which provides an API for tuning execution parameters of code regions using machine learning at runtime. Lastly, Giorgis will talk on understanding deficiencies of parallelism compilation and of the approach to move forward for improving compiler optimizations for parallel programs.

Bio:

Giorgis Georgakoudis is a Computer Scientist in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. His research interests include fault tolerance, optimized compilation of parallel programs, and runtime performance characterization and tuning. He is currently involved in the Exascale Computing Project, designing and developing fault-tolerance abstractions for MPI, and in the Apollo project for dynamic tuning the performance of parallel programs. Also, since October 2020, Giorgis leads his own project on compiler optimizations for parallelism as the Principal Investigator, funded through the Lab Directed Research and Development (LDRD) Program of LLNL.

Giorgis obtained his Dipl. Eng. (2007), Master’s (2010), and PhD degrees (2017) from the Department of Electrical and Computer Eng. of University of Thessaly, Greece. From 2013 to 2018, he was also affiliated with Queen’s University Belfast, UK working as a researcher concurrently with his PhD studies. Since November 2018 Giorgis is working in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. He is also a member of ACM and IEEE societies, and frequently provides professional service as a reviewer in conferences and journals.