Invited Talk – Efficient Computation through Tuned Approximation by David Keyes

21 February 2024

Abstract

Numerical software is being reinvented to provide opportunities to tune dynamically the accuracy of computation to the requirements of the application, resulting in savings of memory, time, and energy.  Floating point computation in science and engineering has a history of “oversolving” relative to expectations for many models. So often are real datatypes defaulted to double precision that GPUs did not gain wide acceptance until they provided in hardware operations not required in their original domain of graphics.  Computational science is now reverting to employ lower precision arithmetic where possible. Many matrix operations allow for lower precision considered at a blockwise level without loss of accuracy, adapting to the magnitude of the norm of the block. Furthermore, many blocks can be approximated with low-rank near equivalents to a prescribed accuracy, adapting to the smoothness of the coefficients of the block.  This leads to smaller memory footprint, which implies higher residency on memory hierarchies, leading in turn to less time and energy spent on data copying, which may even dwarf the savings from fewer and cheaper flops.  We provide examples from several application domains, including Gordon Bell Prize-nominated research in 2022 in environmental statistics and in 2023 in seismic processing.

Bio

David Keyes directs the Extreme Computing Research Center at the King Abdullah University of Science and Technology (KAUST), where he was a founding Dean in 2009 and currently serves in the Office of the President. He is a professor in the programs of Applied Mathematics, Computer Science, and Mechanical Engineering. He is also an Adjunct Professor of Applied Mathematics and Applied Physics at Columbia University, where he formerly held the Fu Foundation Chair. He works at the interface between parallel computing and PDEs and statistics, with a focus on scalable algorithms that exploit data sparsity. Before joining KAUST, Keyes led multi-institutional scalable solver software projects in the SciDAC and ASCI programs of the US Department of Energy (DoE), ran university collaboration programs at US DoE and NASA institutes, and taught at Columbia, Old Dominion, and Yale Universities. He is a Fellow of SIAM, the AMS, and the AAAS. He has been awarded the Gordon Bell Prize from the ACM, the Sidney Fernbach Award from the IEEE Computer Society, and the SIAM Prize for Distinguished Service to the Profession. He earned a B.S.E. in Aerospace and Mechanical Sciences from Princeton in 1978 and a Ph.D. in Applied Mathematics from Harvard in 1984.

Invited Talk – Fine-Grained and Phase-Aware Frequency Scaling for Energy-efficient Computing on Heterogeneous Multi-GPU Systems by Lorenzo Carpentieri

9 May 2025

Abstract

As computing power demands continue to grow, achieving energy efficiency in high-performance systems has become a key challenge. One of the most promising software techniques for energy efficiency is Dynamic Voltage and Frequency Scaling (DVFS) which optimize the energy-performance trade-off by changing hardware frequencies. 

This presentation introduces two complementary approaches that advance the state-of-the-art in energy-efficient heterogeneous computing through fine-grained and phase-aware frequency tuning.

The first approach, SYnergy, leverages a novel compiler- and runtime-integrated methodology built upon the SYCL programming model to enable fine-grained frequency scaling on heterogeneous hardware. SYnergy allows developers to specify energy goals for each individual kernel such as minimizing Energy-Delay Product (EDP) or achieving predefined energy-performance tradeoffs. Through compiler integration and a machine learning model, the frequency of each kernel is statically optimized based on the specific energy goal. To extend this fine-grained control to large-scale systems, SYnergy includes a custom SLURM plugin that enables execution across all available devices in a cluster, ensuring scalable energy savings.

While fine-grained frequency scaling at the kernel level can significantly improve energy efficiency, it also introduces overhead due to frequent frequency changes—an overhead that can, in some cases, outweigh the potential benefits. To address this, we propose a novel Phase-aware method that detects different phases through application profiling and DAG analysis and sets an optimal frequency for each phase. Our methodology also considers MPI programs, where the overhead can be hidden by overlapping frequency-change with communication. 

Bio

Lorenzo Carpentieri received his master’s degrees from the University of Salerno, Italy in 2022. He is now a PhD student in the Department of Computer Science at University of Salerno, Italy, under the supervision of Prof. Biagio Cosenza. His research interests include high-performance computing, compiler technology, and programming models having a particular interest in energy efficient and approximate computing.

Invited Talk – A Constraint Programming Solver You Can Trust (But Don’t Have To) by Ciaran McCreesh

28 August 2025

Abstract

Constraint programming is a declarative way of solving hard combinatorial, scheduling, resource allocation, and logistics problems. We specify a problem in a high-level language, give it to a solver, and the solver thinks for a while and then gives us the optimal answer. Unfortunately, even the best commercial and academic solvers contain bugs, and will occasionally give a wrong answer, potentially with devastating effects. One way of avoiding this situation is through proof logging, where solvers are modified to output a mathematical proof of correctness alongside their solution. This proof can then be independently audited by a very simple (and potentially even formally verified) proof checking tool, giving us complete confidence in the correctness of solutions (although not the solvers themselves). I’ll explain how proof logging works in general, and give an overview of the challenges and fun involved in bringing it to constraint programming. Ultimately, the aim here is to make algorithms something people can trust with their lives and livelihoods, just as engineers have already done with bridges, planes, and lifts.

Bio

Ciaran McCreesh is a Royal Academy of Engineering Research Fellow working in the Formal Analysis, Theory and Algorithms group in the School of Computing Science at the University of Glasgow. His research looks at practical parallel algorithms, particularly in relation to hard subgraph problems. His publications cover combinatorial search, parallel algorithms, and constraint programming.

Parallel Computing Training Session 2023

Date: 6th March 2023
Location: CSB Training Room 02.017

Parallel computing is a key technology supporting high-performance computing (HPC) and data analytics. The goal of this module is to provide an overview of parallel computing and introduce attendees to prevailing programming models. The expected outcome of this module is that participants will have an understanding of the different languages and approaches used in this domain and that they will be able to construct simple parallel programs using the discussed languages.

Theoretical lectures and practical training will be dispensed by Hans Vandierendonck, Romain Garnier, Syed Ibtisam Tauhidi, Amir Sabbagh Molahosseini, and Zohreh Moradinia. A general overview of High performance computing will be presented in the morning by Romain Garnier and Amir Sabbagh Molahosseini following by lab classes using practical examples with OpenMP. In the afternoon Syed Ibtisam Tauhidi and Zohreh Moradinia will present how to efficiently handle data analytics with Hadoop and Spark.

Invited Talk – Enabling high-performance large-scale irregular computations by Maciej Besta

17 June 2021

Abstract:
Large graphs are behind many problems in today’s computing landscape. The
growing sizes of such graphs, reaching 70 trillion edges recently, require
unprecedented amounts of compute power, storage, and energy. In this talk, we
illustrate how to effectively process such extreme-scale graphs. We will first
discuss Slim Graph, the first approach for practical lossy graph compression,
that facilitates high-performance approximate graph processing, storage, and
analytics with strong theoretical guarantees on accuracy, for a broad set of
graph problems and algorithms. In the second part of the talk, we will focus on
a class of complex graph mining problems such as clique listing. Here, we first
show how to solve such problems on complex parallel architectures in a simple
and high-performance way. For this, we propose a novel set-centric paradigm,
where complex algorithms are broken down into simple set algebra building
blocks such as set intersection or union, which can be separately optimized,
executed, and scheduled. Moreover, after discussing how to effectively and
efficiently mine complex graph patterns, we will turn our attention to pattern
prediction. Specifically, we establish a general problem of motif prediction,
in which we generalize link prediction, one of central problems in graph
analytics, into predicting the arrival of arbitrary complex higher-order
structures called motifs. To solve motif prediction, we harness recent graph
neural network architectures.

Bio
Maciej is a researcher from Scalable Parallel Computing Lab at ETH Zurich. He works on large-scale irregular computations and high-performance networking. He received Best Paper awards and Best Student Paper awards at ACM/IEEE Supercomputing 2013, 2014, and 2019, at ACM HPDC 2015 and 2016, ACM Research Highlights 2018, and several further best paper nominations (ACM HPDC 2014, ACM FPGA 2019, and ACM/IEEE Supercomputing 2019). He won, among others, the competition for the Best Student of Poland (2012), the first Google Fellowship in Parallel Computing (2013), and the ACM/IEEE-CS George Michael HPC Fellowship (2015). More detailed information on a personal site: https://people.inf.ethz.ch/bestam/

Invited Talk: Adaptiveness and Lock-free Synchronization in Parallel Stochastic Gradient Descent by Karl Bäckström

3 June 2021

Abstract
The emergence of big data in recent years due to the vast societal digitalization and large-scale sensor deployment has entailed significant interest in machine learning methods to enable automatic data analytics. In a majority of the learning algorithms used in industrial as well as academic settings, the first-order iterative optimization procedure Stochastic gradient descent (SGD), is the backbone. However, SGD is often time-consuming, as it typically requires several passes through the entire dataset to converge to a solution of sufficient quality. In order to cope with increasing data volumes, and to facilitate accelerated processing utilizing contemporary hardware, various parallel SGD variants have been proposed. In addition to traditional synchronous parallelization schemes, asynchronous ones have received particular interest in recent literature due to their improved ability to scale due to less coordination, and subsequently waiting time. However, asynchrony implies inherent challenges in understanding the execution of the algorithm and its convergence properties, due the presence of both stale and inconsistent views of the shared state. In this work, we aim to increase the understanding of the convergence properties of SGD for practical applications under asynchronous parallelism and develop tools and frameworks that facilitate improved convergence properties as well as further research and development. First, we focus on understanding the impact of staleness, and introduce models for capturing the dynamics of parallel execution of SGD. This enables (i) quantifying the statistical penalty on the convergence due to staleness and (ii) deriving an adaptation scheme, introducing a staleness-adaptive SGD variant MindTheStep-AsyncSGD, which provably reduces this penalty. Second, we aim at exploring the impact of synchronization mechanisms, in particular consistency-preserving ones, and the overall effect on the convergence properties. To this end, we propose Leashed-SGD, an extensible algorithmic framework supporting various synchronization mechanisms for different degrees of consistency, enabling in particular a lock-free and consistency-preserving implementation. In addition, the algorithmic construction of Leashed-SGD enables dynamic memory allocation, claiming memory only when necessary, which reduces the overall memory footprint. We perform an extensive empirical study, benchmarking the proposed methods, together with established baselines, focusing on the prominent application of Deep Learning for image classification on the benchmark datasets MNIST and CIFAR, showing significant improvements in converge time for Leashed-SGD and MindTheStep-AsyncSGD.


Bio:

Karl Bäckström is a Ph.D. student at the Distributed Computing and Systems group at Chalmers University of Technology in Sweden. Karl has an academic background in Mathematics, Computer Science, and Engineering physics, with an overarching interest in distributed and parallel computation, optimization, and machine learning. Karl’s research directions include adaptiveness, synchronization, and consistency in parallel algorithms for iterative optimization. At the 35th IEEE International Parallel and Distributed Processing Symposium, Karl with co-authors were awarded Best Paper Honorable Mention for the paper “Consistent Lock-free Parallel Stochastic Gradient Descent for Fast and Stable Convergence”. Karl lives in Gothenburg, a coastal city in western Sweden, together with his Swiss Shepherd Valdi, often enjoying their free time together in nature and wilderness, or at home playing the Piano.

Invited Talk – Efficient Parallel Graph Processing on GPU using Approximate Computing By Somesh Singh

Abstract
Graph algorithms are widely used in several application domains. It has been established that parallelizing graph algorithms is challenging. The parallelization issues get exacerbated when graphics processing unit (GPU) is used to execute graph algorithms. In particular, three important GPU-specific aspects affect performance: memory coalescing, memory latency, and thread divergence. We attempt to tame these challenges using approximate computing. We target graph applications on GPUs that can tolerate some degradation in the quality of the output, in exchange for obtaining the result faster. We propose three techniques for boosting the performance of graph processing on the GPU by injecting approximations in a controlled manner. The first one creates a graph isomorph that brings relevant nodes nearby in memory and adds a controlled replica of nodes to improve coalescing. The second reduces memory latency by facilitating the processing of subgraphs inside shared memory by adding edges among specific nodes and processing well-connected subgraphs iteratively inside shared memory. The third technique normalizes degrees across nodes assigned to a warp to reduce thread divergence. Each technique offers notable performance benefits and provides a knob to control inaccuracy added to an execution. We demonstrate the effectiveness of the proposed techniques using a suite of large graphs with varied characteristics and five popular graph algorithms.


Bio
Somesh Singh is a Ph.D. candidate in the Dept. of CSE at the Indian Institute of Technology Madras, India. His research interests span the areas of high-performance computing, parallel computing, and graph analytics. His dissertation research focuses on designing techniques for improving the efficiency of parallel graph analytics on graphics processing unit (GPU) by trading-off computational accuracy. His PhD works have been accepted for publication at ICPP 2020, PPoPP 2019 (poster) and TMSCS 2018. He was a research intern at Intel Labs, India  and Microsoft Research, India in 2020. He was a Google Summer of Code participant with CERN-HSF in 2017 and 2018.

Invited Talk – Parallel Graph Learning and Computational Biology Through Sparse Matrices by Dr Aydın Buluç

Dr Aydın Buluç
29 April 2021

Solving systems of linear equations have traditionally driven the research in sparse matrix computation for decades. Direct and iterative solvers, together with finite element computations, still account for the primary use case for sparse matrix data structures and algorithms. These sparse “solvers” often serve as the workhorse of many algorithms in spectral graph theory and traditional machine learning. 

In this talk, I will be highlighting two of the emerging use cases of sparse matrices outside the domain of solvers: graph representation learning methods such as graph neural networks (GNNs) and graph kernels, and computational biology problems such as de novo genome assembly and protein family detection. A recurring theme in these novel use cases is the concept of a semiring on which the sparse matrix computations are carried out. By overloading scalar addition and multiplication operators of a semiring, we can attack a much richer set of computational problems using the same sparse data structures and algorithms. This approach has been formalized by the GraphBLAS effort. I will illustrate one example application from each problem domain, together with the most computationally demanding sparse matrix primitive required for its efficient execution. I will also cover novel parallel algorithms for these sparse matrix primitives and available software that implement them efficiently on various architectures.


Aydın Buluç is a Staff Scientist and Principal Investigator at the Lawrence Berkeley National Laboratory (LBNL) and an Adjunct Assistant Professor of EECS at UC Berkeley. His research interests include parallel computing, combinatorial scientific computing, high performance graph analysis and machine learning, sparse matrix computations, and computational biology. Previously, he was a Luis W. Alvarez postdoctoral fellow at LBNL and a visiting scientist at the Simons Institute for the Theory of Computing. He received his PhD in Computer Science from the University of California, Santa Barbara in 2010 and his BS in Computer Science and Engineering from Sabanci University, Turkey in 2005. Dr. Buluç is a recipient of the DOE Early Career Award in 2013 and the IEEE TCSC Award for Excellence for Early Career Researchers in 2015. He is also a founding associate editor of the ACM Transactions on Parallel Computing. As a graduate student, he spent a semester at the Mathematics Department of MIT, and a summer at the CSRI institute of Sandia National Labs, in New Mexico. He is a member of the SIAM and the ACM.

Invited Talk: Metaprogramming in Jupyter Notebooks – Dr Jeremy Singer

Dr Jeremy Singer, University of Glasgow
25 February 2021

Abstract:
Love ’em or hate ’em, interactive computational notebooks are here to stay as a mainstream code development medium. In particular, the Jupyter system is widely used by the data science community. This presentation explores some use cases for programmatic introspection of a Jupyter notebook from within a notebook itself. We sketch a possible reflection API for Jupyter and describe how its implementation is complicated by the under-the-hood message flows of the Jupyter distributed system architecture.


Bio:
Jeremy Singer is a senior lecturer in the School of Computing Science at the University of Glasgow, where he has worked for the past 10 years. Jeremy’s research interests include programming language compilers and runtimes, memory management, manycore parallelism, and distributed systems. He currently co-leads the EPSRC-funded Capable VMs project. Jeremy is the author of the textbook “Operating System Foundations with Linux on the Raspberry Pi” and lead educator of the “Functional Programming in Haskell” massive open online course.

Invited Talk: Addressing Practical HPC Problems: Fault Tolerance, Performance Portability, Parallelism Compilation – Dr Giorgis Georgakoudis

Dr Giorgis Georgakoudis, Lawrence Livermore National Laboratory
18 February 2021

Abstract:

This talk will present an overview of research on different areas of open problems in HPC. On fault tolerance, Giorgis will present the Reinit solution for fault tolerance in large scale MPI applications. Reinit improves the recovery time of checkpointed MPI applications by avoiding MPI re-deployment on restart, extending instead the MPI runtime to repair itself at runtime. On performance portability, Giorgis will present the auto-tuning framework Apollo, which provides an API for tuning execution parameters of code regions using machine learning at runtime. Lastly, Giorgis will talk on understanding deficiencies of parallelism compilation and of the approach to move forward for improving compiler optimizations for parallel programs.

Bio:

Giorgis Georgakoudis is a Computer Scientist in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. His research interests include fault tolerance, optimized compilation of parallel programs, and runtime performance characterization and tuning. He is currently involved in the Exascale Computing Project, designing and developing fault-tolerance abstractions for MPI, and in the Apollo project for dynamic tuning the performance of parallel programs. Also, since October 2020, Giorgis leads his own project on compiler optimizations for parallelism as the Principal Investigator, funded through the Lab Directed Research and Development (LDRD) Program of LLNL.

Giorgis obtained his Dipl. Eng. (2007), Master’s (2010), and PhD degrees (2017) from the Department of Electrical and Computer Eng. of University of Thessaly, Greece. From 2013 to 2018, he was also affiliated with Queen’s University Belfast, UK working as a researcher concurrently with his PhD studies. Since November 2018 Giorgis is working in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. He is also a member of ACM and IEEE societies, and frequently provides professional service as a reviewer in conferences and journals.