Invited Talk – Parallel Graph Learning and Computational Biology Through Sparse Matrices by Dr Aydın Buluç

Dr Aydın Buluç
29 April 2021

Solving systems of linear equations have traditionally driven the research in sparse matrix computation for decades. Direct and iterative solvers, together with finite element computations, still account for the primary use case for sparse matrix data structures and algorithms. These sparse “solvers” often serve as the workhorse of many algorithms in spectral graph theory and traditional machine learning. 

In this talk, I will be highlighting two of the emerging use cases of sparse matrices outside the domain of solvers: graph representation learning methods such as graph neural networks (GNNs) and graph kernels, and computational biology problems such as de novo genome assembly and protein family detection. A recurring theme in these novel use cases is the concept of a semiring on which the sparse matrix computations are carried out. By overloading scalar addition and multiplication operators of a semiring, we can attack a much richer set of computational problems using the same sparse data structures and algorithms. This approach has been formalized by the GraphBLAS effort. I will illustrate one example application from each problem domain, together with the most computationally demanding sparse matrix primitive required for its efficient execution. I will also cover novel parallel algorithms for these sparse matrix primitives and available software that implement them efficiently on various architectures.

Aydın Buluç is a Staff Scientist and Principal Investigator at the Lawrence Berkeley National Laboratory (LBNL) and an Adjunct Assistant Professor of EECS at UC Berkeley. His research interests include parallel computing, combinatorial scientific computing, high performance graph analysis and machine learning, sparse matrix computations, and computational biology. Previously, he was a Luis W. Alvarez postdoctoral fellow at LBNL and a visiting scientist at the Simons Institute for the Theory of Computing. He received his PhD in Computer Science from the University of California, Santa Barbara in 2010 and his BS in Computer Science and Engineering from Sabanci University, Turkey in 2005. Dr. Buluç is a recipient of the DOE Early Career Award in 2013 and the IEEE TCSC Award for Excellence for Early Career Researchers in 2015. He is also a founding associate editor of the ACM Transactions on Parallel Computing. As a graduate student, he spent a semester at the Mathematics Department of MIT, and a summer at the CSRI institute of Sandia National Labs, in New Mexico. He is a member of the SIAM and the ACM.

Scheduling Fine-Grain Loops in Graph Processing Workloads

Scheduling or distributing the computational workload over multiple threads is a critical and repeatedly performed activity in graph processing workloads. In a recent paper “Reducing the burden of parallel loop schedulers for many‐core processors” published in Concurrency & Computation: Practice & Experience, we investigated the overhead introduced by scheduling. This overhead follows from two effects: (i) threads require to communicate and arrive at the same point in the program at the same time; (ii) inter-thread communication incurs significant cache misses and coherence messages sent between processors. We have likened the work distribution to barrier synchronisation and observed that state-of-the-art parallel schedulers such as the Intel OpenMP runtime and Intel Cilkplus incur the cost of a full-barrier synchronisation at the start of a parallel loop and at the end of the loop. The below figure illustrates the synchronisation pattern:

A barrier synchronisation is a synchronisation mechanism that waits for all threads to arrive at the barrier, then signals each thread they may continue execution. If we look in more detail at a barrier, it consists of two phases: a join phase and a release phase:

However, this introduces redundant synchronisation. It suffices to place only a half-barrier synchronisation at the start of the loop, and the other half at the end of the loop. Schematically, this looks like this:

Based on this observation, we designed an optimised scheduling technique that works specifically well for fine-grain loops, which are typically counted loops with very short loop bodies.

Using our optimised scheduler, fine-grain loops in graph processing applications can be sped up by 21.6% to 29.6%. The below figure shows a histogram of the performance obtained for the fine-grain loops in the betweenness centrality kernel (BC). This evaluation was performed on a four-socket 2.6 GHz Intel Xeon E7-4860 v2 machine with 12 physical cores per socket (plus hyperthreading) and30 MB L3 cache per socket. The baseline uses the Intel Cilkplus scheduler, while hybrid demonstrates performance of a hybrid version of the Cilkplus scheduler which can execute a mixture of coarse-grain loops (scheduled using the normal Cilkplus policy) and fine-grain loops using our optimised scheduler.

As graph processing applications contain a mix of fine-grain and coarse-grain loop, overall speedups in these applications is below 5%.

More details can be found in the paper, published under Open Access: