Parallel Computing Training Session 2023

Date: 6th March 2023
Location: CSB Training Room 02.017

Parallel computing is a key technology supporting high-performance computing (HPC) and data analytics. The goal of this module is to provide an overview of parallel computing and introduce attendees to prevailing programming models. The expected outcome of this module is that participants will have an understanding of the different languages and approaches used in this domain and that they will be able to construct simple parallel programs using the discussed languages.

Theoretical lectures and practical training will be dispensed by Hans Vandierendonck, Romain Garnier, Syed Ibtisam Tauhidi, Amir Sabbagh Molahosseini, and Zohreh Moradinia. A general overview of High performance computing will be presented in the morning by Romain Garnier and Amir Sabbagh Molahosseini following by lab classes using practical examples with OpenMP. In the afternoon Syed Ibtisam Tauhidi and Zohreh Moradinia will present how to efficiently handle data analytics with Hadoop and Spark.

ASCCED: Asynchronous Scientific Continuous Computations Exploiting Disaggregation

UKRI EPSRC Grant

The design of efficient and scalable scientific simulation software is reaching a critical point whereby continued advances are increasingly harder, more labour-intensive, and thus more expensive to achieve. This challenge emanates from the constantly evolving design of large-scale high-performance computing systems. World-leading (pre-)exascale systems, as well as their successors, are characterised by multi-million-scale parallel computing activities and a highly heterogeneous mix of processor types such as high-end many-core processors, Graphics Processing Units (GPU), machine learning accelerators, and various accelerators for compression, encryption and in-network processing. To make efficient use of these systems, scientific simulation software must be decomposed in various independent components and make simultaneous use of the variety of heterogeneous compute units.

Developing efficient, scalable scientific simulation software for these systems becomes increasingly harder as the limits of parallelism available in the simulation codes is approached. Moreover, the limit of parallelism cannot be reached in practice due to heterogeneity, system imbalances and synchronisation overheads. Scientific simulation software often persists over several decades. The software is optimised and re-optimised repeatedly as the design and scale of the target hardware evolves at a much faster pace, as impactful changes in the hardware may occur every few years. One may thus find that the guiding principles that underpin such software are outdated.

The ASCCED project will fundamentally change the status quo in the design of scientific simulation software by simplifying the design to reduce software development and maintenance effort, to facilitate performance optimisation, and to make software more robust to future evolution of computing hardware. The key distinguishing factor of our approach is to structure scientific simulation software as a collection of loosely coupled parallel activities. We will explore the opportunities and challenges of applying techniques previously developed for Parallel Discrete Event Simulation (PDES) to orchestrate these loosely coupled parallel activities. This radically novel approach will enable runtime system software to extract unprecedented scales of parallelism and to minimise performance inefficiencies due to synchronisation. Additionally, based on a speculative execution mechanism, it will uncover parallelism that has not been feasible to extract before.

The computational model proposed by ASCCED will, if successful, initiate a new direction of research within programming models for high-performance computing that may dramatically impact not only the performance of scientific simulation software, but can also reduce the engineering effort required to produce efficient scientific simulation software. It will have a profound impact on the sciences that are highly dependent on leadership computing capabilities, such as climate modeling and cancer research.

ROMA: Run-Time Object Detection To Maximize Real-Time Accuracy

This is a follow-up work on TOD, which selects one of multiple deep neural networks (DNNs) to perform real-time video analytics (object detection) on low-end devices, e.g., on the camera itself. TOD uses the median object size to determine which one out of 4 YOLO DNNs will meet the real-time requirement best, with respect to object size and speed of the objects. TOD requires specific knowledge of the device to select appropriate thresholds on the median object sizes, and needs to be retuned for each computing device.

ROMA removes the limitation that TOD imposes by performing a more detailed analysis of the image content. In particular, ROMA separately estimates the impact of the selected DNN on object size and on object speed. Its formulation is sufficiently flexible to adapt to changes in the computational power of the device such that it does not need to be retrained when migrating across hardware. Moreover, this way, ROMA can adapt to runtime changes in computational power, which may arise from power management features on the device, or from other workloads which share the device. ROMA does however have hyper-parameters that are dependent on the YOLO DNNs.

ROMA will be presented at the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

You can read the post-print on arXiv.

TOD: Transprecise Object Detection

Real-time video analytics on the edge is challenging as the computationally constrained resources typically cannot analyse video streams at full fidelity and frame rate, which results in loss of accuracy. We propose a Transprecise Object Detector (TOD) which maximises the real-time object detection accuracy on an edge device by selecting an appropriate Deep Neural Network (DNN) on the fly with negligible computational overhead.

TOD makes two key contributions over the state of the art: (1) TOD leverages characteristics of the video stream such as object size and speed of movement to identify networks with high prediction accuracy for the current frames; (2) it selects the best-performing network based on projected accuracy and computational demand using an effective and low-overhead decision mechanism.

Experimental evaluation on a Jetson Nano demonstrates that TOD improves the average object detection precision by 34.7 % over the YOLOv4-tiny-288 model on average over the MOT17Det dataset. In the MOT17-05 test dataset, TOD utilises only 45.1 % of GPU resource and 62.7 % of the GPU board power without losing accuracy, compared to YOLOv4-416 model. We expect that TOD will maximise the application of edge devices to real-time object detection, since TOD maximises real-time object detection accuracy given edge devices according to dynamic input features without increasing inference latency in practice.

TOD was presented at the 5th International Conference on Fog and Edge Computing (ICFEC). Read the paper on arXiv.

Approximate Maximum Weighted Clique

This project aims to develop novel algorithms for the maximum weighted clique (MWC) problem, which appears in various data analysis pipelines in precision medicine. The MWC problem is NP-hard in nature, which makes it particularly challenging given the exponentially increasing amount of data it is applied to.

Although several attempts have been made to solve the maximum weighted clique problem in large graphs, there is still much opportunity for lowering the execution time necessary to find a satisfactory solution. In this project in particular we are investigating approximate algorithms for the MWC problem. We are working towards an algorithm that achieves a very high quality solution (i.e., finding a clique with weight very close to the MWC) in polynomial time.

IBM will provide industrially relevant context on knowledge extraction from graph-structured data. They have extensive experience in this area by building scalable software systems for the analysis of massive-scale graph data. They will moreover provide access to relevant datasets.

Project Members

Publications

  • QClique: Optimizing Performance and Accuracy in Maximum Weighted Clique, EURO-PAR 2024, link.

Funding

This PhD project is funded by the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie Actions.

SoftNum: Software-Defined Number Formats

Computing devices implement computer arithmetic as basic functionality, and they implement the same, standardized number formats in order to support software portability. However, with Moore’s Law ending, we question whether it remains the best approach to achieve high performance and low energy consumption by applying the same standardized number formats for all applications. We explore how to make number formats, generally considered to be hard-wired functionality, software-defined. Software-defined number formats have the advantage of high performance, low energy consumption, and ensure sufficient while not excessive precision.

SoftNum is Amir’s individual Marie Sklodowska Curie Fellowship, sponsored by the European Commission.

Invited Talk – Enabling high-performance large-scale irregular computations by Maciej Besta

17 June 2021

Abstract:
Large graphs are behind many problems in today’s computing landscape. The
growing sizes of such graphs, reaching 70 trillion edges recently, require
unprecedented amounts of compute power, storage, and energy. In this talk, we
illustrate how to effectively process such extreme-scale graphs. We will first
discuss Slim Graph, the first approach for practical lossy graph compression,
that facilitates high-performance approximate graph processing, storage, and
analytics with strong theoretical guarantees on accuracy, for a broad set of
graph problems and algorithms. In the second part of the talk, we will focus on
a class of complex graph mining problems such as clique listing. Here, we first
show how to solve such problems on complex parallel architectures in a simple
and high-performance way. For this, we propose a novel set-centric paradigm,
where complex algorithms are broken down into simple set algebra building
blocks such as set intersection or union, which can be separately optimized,
executed, and scheduled. Moreover, after discussing how to effectively and
efficiently mine complex graph patterns, we will turn our attention to pattern
prediction. Specifically, we establish a general problem of motif prediction,
in which we generalize link prediction, one of central problems in graph
analytics, into predicting the arrival of arbitrary complex higher-order
structures called motifs. To solve motif prediction, we harness recent graph
neural network architectures.

Bio
Maciej is a researcher from Scalable Parallel Computing Lab at ETH Zurich. He works on large-scale irregular computations and high-performance networking. He received Best Paper awards and Best Student Paper awards at ACM/IEEE Supercomputing 2013, 2014, and 2019, at ACM HPDC 2015 and 2016, ACM Research Highlights 2018, and several further best paper nominations (ACM HPDC 2014, ACM FPGA 2019, and ACM/IEEE Supercomputing 2019). He won, among others, the competition for the Best Student of Poland (2012), the first Google Fellowship in Parallel Computing (2013), and the ACM/IEEE-CS George Michael HPC Fellowship (2015). More detailed information on a personal site: https://people.inf.ethz.ch/bestam/

Invited Talk: Adaptiveness and Lock-free Synchronization in Parallel Stochastic Gradient Descent by Karl Bäckström

3 June 2021

Abstract
The emergence of big data in recent years due to the vast societal digitalization and large-scale sensor deployment has entailed significant interest in machine learning methods to enable automatic data analytics. In a majority of the learning algorithms used in industrial as well as academic settings, the first-order iterative optimization procedure Stochastic gradient descent (SGD), is the backbone. However, SGD is often time-consuming, as it typically requires several passes through the entire dataset to converge to a solution of sufficient quality. In order to cope with increasing data volumes, and to facilitate accelerated processing utilizing contemporary hardware, various parallel SGD variants have been proposed. In addition to traditional synchronous parallelization schemes, asynchronous ones have received particular interest in recent literature due to their improved ability to scale due to less coordination, and subsequently waiting time. However, asynchrony implies inherent challenges in understanding the execution of the algorithm and its convergence properties, due the presence of both stale and inconsistent views of the shared state. In this work, we aim to increase the understanding of the convergence properties of SGD for practical applications under asynchronous parallelism and develop tools and frameworks that facilitate improved convergence properties as well as further research and development. First, we focus on understanding the impact of staleness, and introduce models for capturing the dynamics of parallel execution of SGD. This enables (i) quantifying the statistical penalty on the convergence due to staleness and (ii) deriving an adaptation scheme, introducing a staleness-adaptive SGD variant MindTheStep-AsyncSGD, which provably reduces this penalty. Second, we aim at exploring the impact of synchronization mechanisms, in particular consistency-preserving ones, and the overall effect on the convergence properties. To this end, we propose Leashed-SGD, an extensible algorithmic framework supporting various synchronization mechanisms for different degrees of consistency, enabling in particular a lock-free and consistency-preserving implementation. In addition, the algorithmic construction of Leashed-SGD enables dynamic memory allocation, claiming memory only when necessary, which reduces the overall memory footprint. We perform an extensive empirical study, benchmarking the proposed methods, together with established baselines, focusing on the prominent application of Deep Learning for image classification on the benchmark datasets MNIST and CIFAR, showing significant improvements in converge time for Leashed-SGD and MindTheStep-AsyncSGD.


Bio:

Karl Bäckström is a Ph.D. student at the Distributed Computing and Systems group at Chalmers University of Technology in Sweden. Karl has an academic background in Mathematics, Computer Science, and Engineering physics, with an overarching interest in distributed and parallel computation, optimization, and machine learning. Karl’s research directions include adaptiveness, synchronization, and consistency in parallel algorithms for iterative optimization. At the 35th IEEE International Parallel and Distributed Processing Symposium, Karl with co-authors were awarded Best Paper Honorable Mention for the paper “Consistent Lock-free Parallel Stochastic Gradient Descent for Fast and Stable Convergence”. Karl lives in Gothenburg, a coastal city in western Sweden, together with his Swiss Shepherd Valdi, often enjoying their free time together in nature and wilderness, or at home playing the Piano.

Invited Talk – Efficient Parallel Graph Processing on GPU using Approximate Computing By Somesh Singh

Abstract
Graph algorithms are widely used in several application domains. It has been established that parallelizing graph algorithms is challenging. The parallelization issues get exacerbated when graphics processing unit (GPU) is used to execute graph algorithms. In particular, three important GPU-specific aspects affect performance: memory coalescing, memory latency, and thread divergence. We attempt to tame these challenges using approximate computing. We target graph applications on GPUs that can tolerate some degradation in the quality of the output, in exchange for obtaining the result faster. We propose three techniques for boosting the performance of graph processing on the GPU by injecting approximations in a controlled manner. The first one creates a graph isomorph that brings relevant nodes nearby in memory and adds a controlled replica of nodes to improve coalescing. The second reduces memory latency by facilitating the processing of subgraphs inside shared memory by adding edges among specific nodes and processing well-connected subgraphs iteratively inside shared memory. The third technique normalizes degrees across nodes assigned to a warp to reduce thread divergence. Each technique offers notable performance benefits and provides a knob to control inaccuracy added to an execution. We demonstrate the effectiveness of the proposed techniques using a suite of large graphs with varied characteristics and five popular graph algorithms.


Bio
Somesh Singh is a Ph.D. candidate in the Dept. of CSE at the Indian Institute of Technology Madras, India. His research interests span the areas of high-performance computing, parallel computing, and graph analytics. His dissertation research focuses on designing techniques for improving the efficiency of parallel graph analytics on graphics processing unit (GPU) by trading-off computational accuracy. His PhD works have been accepted for publication at ICPP 2020, PPoPP 2019 (poster) and TMSCS 2018. He was a research intern at Intel Labs, India  and Microsoft Research, India in 2020. He was a Google Summer of Code participant with CERN-HSF in 2017 and 2018.

Invited Talk – Parallel Graph Learning and Computational Biology Through Sparse Matrices by Dr Aydın Buluç

Dr Aydın Buluç
29 April 2021

Solving systems of linear equations have traditionally driven the research in sparse matrix computation for decades. Direct and iterative solvers, together with finite element computations, still account for the primary use case for sparse matrix data structures and algorithms. These sparse “solvers” often serve as the workhorse of many algorithms in spectral graph theory and traditional machine learning. 

In this talk, I will be highlighting two of the emerging use cases of sparse matrices outside the domain of solvers: graph representation learning methods such as graph neural networks (GNNs) and graph kernels, and computational biology problems such as de novo genome assembly and protein family detection. A recurring theme in these novel use cases is the concept of a semiring on which the sparse matrix computations are carried out. By overloading scalar addition and multiplication operators of a semiring, we can attack a much richer set of computational problems using the same sparse data structures and algorithms. This approach has been formalized by the GraphBLAS effort. I will illustrate one example application from each problem domain, together with the most computationally demanding sparse matrix primitive required for its efficient execution. I will also cover novel parallel algorithms for these sparse matrix primitives and available software that implement them efficiently on various architectures.


Aydın Buluç is a Staff Scientist and Principal Investigator at the Lawrence Berkeley National Laboratory (LBNL) and an Adjunct Assistant Professor of EECS at UC Berkeley. His research interests include parallel computing, combinatorial scientific computing, high performance graph analysis and machine learning, sparse matrix computations, and computational biology. Previously, he was a Luis W. Alvarez postdoctoral fellow at LBNL and a visiting scientist at the Simons Institute for the Theory of Computing. He received his PhD in Computer Science from the University of California, Santa Barbara in 2010 and his BS in Computer Science and Engineering from Sabanci University, Turkey in 2005. Dr. Buluç is a recipient of the DOE Early Career Award in 2013 and the IEEE TCSC Award for Excellence for Early Career Researchers in 2015. He is also a founding associate editor of the ACM Transactions on Parallel Computing. As a graduate student, he spent a semester at the Mathematics Department of MIT, and a summer at the CSRI institute of Sandia National Labs, in New Mexico. He is a member of the SIAM and the ACM.