Invited Talk – Enabling high-performance large-scale irregular computations by Maciej Besta

17 June 2021

Abstract:
Large graphs are behind many problems in today’s computing landscape. The
growing sizes of such graphs, reaching 70 trillion edges recently, require
unprecedented amounts of compute power, storage, and energy. In this talk, we
illustrate how to effectively process such extreme-scale graphs. We will first
discuss Slim Graph, the first approach for practical lossy graph compression,
that facilitates high-performance approximate graph processing, storage, and
analytics with strong theoretical guarantees on accuracy, for a broad set of
graph problems and algorithms. In the second part of the talk, we will focus on
a class of complex graph mining problems such as clique listing. Here, we first
show how to solve such problems on complex parallel architectures in a simple
and high-performance way. For this, we propose a novel set-centric paradigm,
where complex algorithms are broken down into simple set algebra building
blocks such as set intersection or union, which can be separately optimized,
executed, and scheduled. Moreover, after discussing how to effectively and
efficiently mine complex graph patterns, we will turn our attention to pattern
prediction. Specifically, we establish a general problem of motif prediction,
in which we generalize link prediction, one of central problems in graph
analytics, into predicting the arrival of arbitrary complex higher-order
structures called motifs. To solve motif prediction, we harness recent graph
neural network architectures.

Bio
Maciej is a researcher from Scalable Parallel Computing Lab at ETH Zurich. He works on large-scale irregular computations and high-performance networking. He received Best Paper awards and Best Student Paper awards at ACM/IEEE Supercomputing 2013, 2014, and 2019, at ACM HPDC 2015 and 2016, ACM Research Highlights 2018, and several further best paper nominations (ACM HPDC 2014, ACM FPGA 2019, and ACM/IEEE Supercomputing 2019). He won, among others, the competition for the Best Student of Poland (2012), the first Google Fellowship in Parallel Computing (2013), and the ACM/IEEE-CS George Michael HPC Fellowship (2015). More detailed information on a personal site: https://people.inf.ethz.ch/bestam/

Invited Talk: Adaptiveness and Lock-free Synchronization in Parallel Stochastic Gradient Descent by Karl Bäckström

3 June 2021

Abstract
The emergence of big data in recent years due to the vast societal digitalization and large-scale sensor deployment has entailed significant interest in machine learning methods to enable automatic data analytics. In a majority of the learning algorithms used in industrial as well as academic settings, the first-order iterative optimization procedure Stochastic gradient descent (SGD), is the backbone. However, SGD is often time-consuming, as it typically requires several passes through the entire dataset to converge to a solution of sufficient quality. In order to cope with increasing data volumes, and to facilitate accelerated processing utilizing contemporary hardware, various parallel SGD variants have been proposed. In addition to traditional synchronous parallelization schemes, asynchronous ones have received particular interest in recent literature due to their improved ability to scale due to less coordination, and subsequently waiting time. However, asynchrony implies inherent challenges in understanding the execution of the algorithm and its convergence properties, due the presence of both stale and inconsistent views of the shared state. In this work, we aim to increase the understanding of the convergence properties of SGD for practical applications under asynchronous parallelism and develop tools and frameworks that facilitate improved convergence properties as well as further research and development. First, we focus on understanding the impact of staleness, and introduce models for capturing the dynamics of parallel execution of SGD. This enables (i) quantifying the statistical penalty on the convergence due to staleness and (ii) deriving an adaptation scheme, introducing a staleness-adaptive SGD variant MindTheStep-AsyncSGD, which provably reduces this penalty. Second, we aim at exploring the impact of synchronization mechanisms, in particular consistency-preserving ones, and the overall effect on the convergence properties. To this end, we propose Leashed-SGD, an extensible algorithmic framework supporting various synchronization mechanisms for different degrees of consistency, enabling in particular a lock-free and consistency-preserving implementation. In addition, the algorithmic construction of Leashed-SGD enables dynamic memory allocation, claiming memory only when necessary, which reduces the overall memory footprint. We perform an extensive empirical study, benchmarking the proposed methods, together with established baselines, focusing on the prominent application of Deep Learning for image classification on the benchmark datasets MNIST and CIFAR, showing significant improvements in converge time for Leashed-SGD and MindTheStep-AsyncSGD.


Bio:

Karl Bäckström is a Ph.D. student at the Distributed Computing and Systems group at Chalmers University of Technology in Sweden. Karl has an academic background in Mathematics, Computer Science, and Engineering physics, with an overarching interest in distributed and parallel computation, optimization, and machine learning. Karl’s research directions include adaptiveness, synchronization, and consistency in parallel algorithms for iterative optimization. At the 35th IEEE International Parallel and Distributed Processing Symposium, Karl with co-authors were awarded Best Paper Honorable Mention for the paper “Consistent Lock-free Parallel Stochastic Gradient Descent for Fast and Stable Convergence”. Karl lives in Gothenburg, a coastal city in western Sweden, together with his Swiss Shepherd Valdi, often enjoying their free time together in nature and wilderness, or at home playing the Piano.

Invited Talk – Efficient Parallel Graph Processing on GPU using Approximate Computing By Somesh Singh

Abstract
Graph algorithms are widely used in several application domains. It has been established that parallelizing graph algorithms is challenging. The parallelization issues get exacerbated when graphics processing unit (GPU) is used to execute graph algorithms. In particular, three important GPU-specific aspects affect performance: memory coalescing, memory latency, and thread divergence. We attempt to tame these challenges using approximate computing. We target graph applications on GPUs that can tolerate some degradation in the quality of the output, in exchange for obtaining the result faster. We propose three techniques for boosting the performance of graph processing on the GPU by injecting approximations in a controlled manner. The first one creates a graph isomorph that brings relevant nodes nearby in memory and adds a controlled replica of nodes to improve coalescing. The second reduces memory latency by facilitating the processing of subgraphs inside shared memory by adding edges among specific nodes and processing well-connected subgraphs iteratively inside shared memory. The third technique normalizes degrees across nodes assigned to a warp to reduce thread divergence. Each technique offers notable performance benefits and provides a knob to control inaccuracy added to an execution. We demonstrate the effectiveness of the proposed techniques using a suite of large graphs with varied characteristics and five popular graph algorithms.


Bio
Somesh Singh is a Ph.D. candidate in the Dept. of CSE at the Indian Institute of Technology Madras, India. His research interests span the areas of high-performance computing, parallel computing, and graph analytics. His dissertation research focuses on designing techniques for improving the efficiency of parallel graph analytics on graphics processing unit (GPU) by trading-off computational accuracy. His PhD works have been accepted for publication at ICPP 2020, PPoPP 2019 (poster) and TMSCS 2018. He was a research intern at Intel Labs, India  and Microsoft Research, India in 2020. He was a Google Summer of Code participant with CERN-HSF in 2017 and 2018.

Invited Talk – Parallel Graph Learning and Computational Biology Through Sparse Matrices by Dr Aydın Buluç

Dr Aydın Buluç
29 April 2021

Solving systems of linear equations have traditionally driven the research in sparse matrix computation for decades. Direct and iterative solvers, together with finite element computations, still account for the primary use case for sparse matrix data structures and algorithms. These sparse “solvers” often serve as the workhorse of many algorithms in spectral graph theory and traditional machine learning. 

In this talk, I will be highlighting two of the emerging use cases of sparse matrices outside the domain of solvers: graph representation learning methods such as graph neural networks (GNNs) and graph kernels, and computational biology problems such as de novo genome assembly and protein family detection. A recurring theme in these novel use cases is the concept of a semiring on which the sparse matrix computations are carried out. By overloading scalar addition and multiplication operators of a semiring, we can attack a much richer set of computational problems using the same sparse data structures and algorithms. This approach has been formalized by the GraphBLAS effort. I will illustrate one example application from each problem domain, together with the most computationally demanding sparse matrix primitive required for its efficient execution. I will also cover novel parallel algorithms for these sparse matrix primitives and available software that implement them efficiently on various architectures.


Aydın Buluç is a Staff Scientist and Principal Investigator at the Lawrence Berkeley National Laboratory (LBNL) and an Adjunct Assistant Professor of EECS at UC Berkeley. His research interests include parallel computing, combinatorial scientific computing, high performance graph analysis and machine learning, sparse matrix computations, and computational biology. Previously, he was a Luis W. Alvarez postdoctoral fellow at LBNL and a visiting scientist at the Simons Institute for the Theory of Computing. He received his PhD in Computer Science from the University of California, Santa Barbara in 2010 and his BS in Computer Science and Engineering from Sabanci University, Turkey in 2005. Dr. Buluç is a recipient of the DOE Early Career Award in 2013 and the IEEE TCSC Award for Excellence for Early Career Researchers in 2015. He is also a founding associate editor of the ACM Transactions on Parallel Computing. As a graduate student, he spent a semester at the Mathematics Department of MIT, and a summer at the CSRI institute of Sandia National Labs, in New Mexico. He is a member of the SIAM and the ACM.

Scheduling Fine-Grain Loops in Graph Processing Workloads

Scheduling or distributing the computational workload over multiple threads is a critical and repeatedly performed activity in graph processing workloads. In a recent paper “Reducing the burden of parallel loop schedulers for many‐core processors” published in Concurrency & Computation: Practice & Experience, we investigated the overhead introduced by scheduling. This overhead follows from two effects: (i) threads require to communicate and arrive at the same point in the program at the same time; (ii) inter-thread communication incurs significant cache misses and coherence messages sent between processors. We have likened the work distribution to barrier synchronisation and observed that state-of-the-art parallel schedulers such as the Intel OpenMP runtime and Intel Cilkplus incur the cost of a full-barrier synchronisation at the start of a parallel loop and at the end of the loop. The below figure illustrates the synchronisation pattern:

A barrier synchronisation is a synchronisation mechanism that waits for all threads to arrive at the barrier, then signals each thread they may continue execution. If we look in more detail at a barrier, it consists of two phases: a join phase and a release phase:

However, this introduces redundant synchronisation. It suffices to place only a half-barrier synchronisation at the start of the loop, and the other half at the end of the loop. Schematically, this looks like this:

Based on this observation, we designed an optimised scheduling technique that works specifically well for fine-grain loops, which are typically counted loops with very short loop bodies.

Using our optimised scheduler, fine-grain loops in graph processing applications can be sped up by 21.6% to 29.6%. The below figure shows a histogram of the performance obtained for the fine-grain loops in the betweenness centrality kernel (BC). This evaluation was performed on a four-socket 2.6 GHz Intel Xeon E7-4860 v2 machine with 12 physical cores per socket (plus hyperthreading) and30 MB L3 cache per socket. The baseline uses the Intel Cilkplus scheduler, while hybrid demonstrates performance of a hybrid version of the Cilkplus scheduler which can execute a mixture of coarse-grain loops (scheduled using the normal Cilkplus policy) and fine-grain loops using our optimised scheduler.

As graph processing applications contain a mix of fine-grain and coarse-grain loop, overall speedups in these applications is below 5%.

More details can be found in the paper, published under Open Access: https://onlinelibrary.wiley.com/doi/10.1002/cpe.6241

Invited Talk: Metaprogramming in Jupyter Notebooks – Dr Jeremy Singer

Dr Jeremy Singer, University of Glasgow
25 February 2021

Abstract:
Love ’em or hate ’em, interactive computational notebooks are here to stay as a mainstream code development medium. In particular, the Jupyter system is widely used by the data science community. This presentation explores some use cases for programmatic introspection of a Jupyter notebook from within a notebook itself. We sketch a possible reflection API for Jupyter and describe how its implementation is complicated by the under-the-hood message flows of the Jupyter distributed system architecture.


Bio:
Jeremy Singer is a senior lecturer in the School of Computing Science at the University of Glasgow, where he has worked for the past 10 years. Jeremy’s research interests include programming language compilers and runtimes, memory management, manycore parallelism, and distributed systems. He currently co-leads the EPSRC-funded Capable VMs project. Jeremy is the author of the textbook “Operating System Foundations with Linux on the Raspberry Pi” and lead educator of the “Functional Programming in Haskell” massive open online course.

Invited Talk: Addressing Practical HPC Problems: Fault Tolerance, Performance Portability, Parallelism Compilation – Dr Giorgis Georgakoudis

Dr Giorgis Georgakoudis, Lawrence Livermore National Laboratory
18 February 2021

Abstract:

This talk will present an overview of research on different areas of open problems in HPC. On fault tolerance, Giorgis will present the Reinit solution for fault tolerance in large scale MPI applications. Reinit improves the recovery time of checkpointed MPI applications by avoiding MPI re-deployment on restart, extending instead the MPI runtime to repair itself at runtime. On performance portability, Giorgis will present the auto-tuning framework Apollo, which provides an API for tuning execution parameters of code regions using machine learning at runtime. Lastly, Giorgis will talk on understanding deficiencies of parallelism compilation and of the approach to move forward for improving compiler optimizations for parallel programs.

Bio:

Giorgis Georgakoudis is a Computer Scientist in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. His research interests include fault tolerance, optimized compilation of parallel programs, and runtime performance characterization and tuning. He is currently involved in the Exascale Computing Project, designing and developing fault-tolerance abstractions for MPI, and in the Apollo project for dynamic tuning the performance of parallel programs. Also, since October 2020, Giorgis leads his own project on compiler optimizations for parallelism as the Principal Investigator, funded through the Lab Directed Research and Development (LDRD) Program of LLNL.

Giorgis obtained his Dipl. Eng. (2007), Master’s (2010), and PhD degrees (2017) from the Department of Electrical and Computer Eng. of University of Thessaly, Greece. From 2013 to 2018, he was also affiliated with Queen’s University Belfast, UK working as a researcher concurrently with his PhD studies. Since November 2018 Giorgis is working in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. He is also a member of ACM and IEEE societies, and frequently provides professional service as a reviewer in conferences and journals.

Invited Talk: Parallel Algorithms for Density-Based and Structural Clustering – Dr Julian Shun

Dr Julian Shun, Massachusetts Institute of Technology
14 January 2021

Abstract: This talk presents new parallel algorithms for density-based
spatial clustering (DBSCAN) on point sets and structural clustering
(SCAN) on graphs, two problems that have received significant
attention due to their applicability in a variety of data analysis
tasks.

Existing parallel algorithms for DBSCAN require much more work than
their sequential counterparts, making them inefficient for large
datasets.  We bridge the gap between theory and practice of parallel
DBSCAN by presenting new parallel algorithms for Euclidean exact
DBSCAN and approximate DBSCAN that match the work bounds of their
sequential counterparts, and are highly parallel (polylogarithmic
depth).  For graphs, we present the first parallel index-based SCAN
algorithm, based a recent sequential algorithm, which enables users to
efficiently explore many different parameter settings for cluster
generation. Our parallel algorithm has a better work bound than the
sequential algorithm, and achieves logarithmic depth. We also apply
locality-sensitive hashing to design a novel approximate SCAN
algorithm and prove guarantees for its clustering quality.  We present
optimized implementations of our algorithms which achieve good
parallel scalability and outperform existing parallel implementations
by up to several orders of magnitude.

Bio: Julian Shun is the Douglas T. Ross Career Development Assistant
Professor of Software Technology in EECS and CSAIL at MIT.  Prior to
coming to MIT, he was a Miller Research Fellow at UC Berkeley.  He
received his Ph.D. from Carnegie Mellon University and his B.A. from
UC Berkeley.  His research focuses on the theory and practice of
parallel algorithms and programming frameworks.  He has received the
NSF CAREER Award, DOE Early Career Award, ACM Doctoral Dissertation
Award, CMU School of Computer Science Doctoral Dissertation Award,
Facebook Graduate Fellowship, Google Faculty Research Award, and best 
paper awards at PLDI, SPAA, and DCC.

Half-Precision Floating-Point Formats for PageRank: Opportunities and Challenges

https://doi.org/10.1109/HPEC43674.2020.9286179

Mixed-precision computation has been proposed as a means to accelerate iterative algorithms as it can reduce the memory bandwidth and cache effectiveness. This paper aims for further memory traffic reduction via introducing new half-precision (16 bit) data formats customized for PageRank. We develop two formats. A first format builds on the observation that the exponents of about 99% of PageRank values are tightly distributed around the exponent of the inverse of the number of vertices. A second format builds on the observation that 6 exponent bits are sufficient to capture the full dynamic range of PageRank values. Our floating-point formats provide less precision compared to standard IEEE 754 formats, but sufficient dynamic range for PageRank. The experimental results on various size graphs show that the proposed formats can achieve an accuracy of 1e-4., which is an improvement over the state of the art. Due to random memory access patterns in the algorithm, performance improvements over our highly tuned baseline are 1.5% at best.