ParaGrapher: A Parallel and Distributed Graph Loading Library for Large-Scale Compressed Graphs – BigData’25 (Short Paper)


DOI:

Whereas the literature describes an increasing number of graph algorithms, loading graphs remains a time-consuming component of the end-to-end execution time. Graph frameworks often rely on custom graph storage formats, that are not optimized for efficient loading of large-scale graph datasets. Furthermore, graph loading is often not optimized as it is time-consuming to implement.

This shows a demand for high-performance libraries capable of efficiently loading graphs to (i) accelerate designing new graph algorithms, (ii) to evaluate the contributions across a wide range of graph datasets, and (iii) to facilitate easy and fast comparisons across different graph frameworks.

We present ParaGrapher, a library for loading large-scale compressed graphs in parallel and distributed graph frameworks. ParaGrapher supports (a) loading the graph while the caller is blocked and (b) interleaving graph loading with graph processing. ParaGrapher is designed to support loading graphs in shared-memory, distributed-memory, and out-of-core graph processing.

We explain the design of ParaGrapher and present a performance model of graph decompression. Our evaluation shows that ParaGrapher delivers up to 3.2 times speedup in loading and up to 5.2 times speedup in end-to-end execution (i.e., through interleaved loading and execution).

Source Code

https://github.com/DIPSA-QUB/ParaGrapher

API Documentation

Please refer to the Wiki, https://github.com/DIPSA-QUB/ParaGrapher/wiki/API-Documentation, or download the PDF file using https://github.com/DIPSA-QUB/ParaGrapher/raw/main/doc/api.pdf .

BibTex

@articel{paragrapher-bigdata,

}

Related Posts & Source Code

ParaGrapher Web Page

OrbitSI is now on PyPi

OrbitSI is an open-source Python framework designed to efficiently solve the subgraph isomorphism enumeration problem, i.e., identifying all subgraphs within a data graph that are structurally identical to a given pattern graph. The tool introduces an orbit-aware pruning and ordering strategy that significantly improves enumeration speed compared to classical algorithms. OrbitSI enhances computational performance by integrating structural information about node roles, referred to as orbits, to prune the search space before enumeration. It is built atop NetworkX and C++ backends such as EVOKE and ORCA. The framework supports both command-line and Python interfaces, enabling researchers and practitioners to perform subgraph searches and orbit counting tasks with ease. It is distributed under the Apache 2.0 License.

Check out OrbitSI on Github: https://github.com/ibtisamtauhidi/OrbitSI

Install through PyPi: https://pypi.org/project/orbitsi/

Wasp: Efficient Asynchronous Single-Source Shortest Path on Multicore Systems via Work Stealing – SC’25

We are happy to announce that Marco’s paper was accepted as Supercomputing 2025, taking place the 16-21 November in St. Louis. Marco will present its research paper on Thursday 20th at 3:30pm in the “Algorithms: Matching System Capabilities” session.

Abstract

The Single-Source Shortest Path (SSSP) problem is a fundamental graph problem with an extensive set of real-world applications. State-of-the-art parallel algorithms for SSSP, such as the ∆-stepping algorithm, create parallelism through priority coarsening. Priority coarsening results in redundant computations that diminish the benefits of parallelization and limit parallel scalability.

This paper introduces Wasp, a novel SSSP algorithm that reduces parallelism-induced redundant work by utilizing asynchrony and an efficient priority-aware work stealing scheme. Contrary to previous work, Wasp introduces redundant computations only when threads have no high-priority work locally available to execute. This is achieved by a novel priority-aware work stealing mechanism that controls the inefficiencies of indiscriminate priority coarsening.

Experimental evaluation shows competitive or better performance compared to GAP, GBBS, MultiQueues, Galois, ∆*-stepping, and ρ-stepping on 13 diverse graphs with geometric mean speedups of 2.2x on AMD Zen 3 and 2.1x on Intel Sapphire Rapids using 128 threads.

Source Code

The source code is available at: https://github.com/DIPSA-QUB/wasp
Paper available at: https://dl.acm.org/doi/10.1145/3712285.3759872

Best Poster Award at IPDPS 2025

We’re excited to share that Marco has been awarded the Best Poster Award at the IPDPS 2025 PhD Forum! Marco’s winning poster, titled “Towards Efficient Asynchronous Single-Source Shortest Path”, presents Wasp, a novel algorithm that tackles the fundamental Single-Source Shortest Path problem. His approach addresses a key challenge in parallel graph algorithms by introducing on-demand priority relaxation through a priority-based work-stealing mechanism.

Congratulations, Marco, on this well-deserved recognition! We’re proud to have you as part of our research team and look forward to seeing your continued contributions to the field.

DIPSA at IPDPS’25

Two of our papers were accepted at IPDPS’25.

Brian will present his work on improving the scalability of parallel molecular dynamics simulation. He has developed a novel way to reduce the scalability bottleneck that exists in the communication between those processes computing short-range forces vs those computing long-range forces. His technique discards data dependences when long-range processes are “too slow” and uses interpolation of the (slowly-varying) long-range forces to progress the computation. Stay tuned for the camera-ready copy of the paper! This work was supported by the EPSRC New Horizons project ASCCED (EP/X01794X/1).

Hans will present a parallel algorithm for the maximum clique problem. The key ideas relate to reducing the amount of work where possible, which includes delaying or avoiding the construction of fast representations of neighbour lists, early-exiting set intersection operations and algorithmic choice between maximum clique search and the complementary minimum vertex cover problem.

Addtionally, Marco will attend IPDPS’25 by virtue of a travel grant from the TCHPC/TCPP HPC student cohort programme.

On Optimizing Locality of Graph Transposition on Modern Architectures

DOI: 10.48550/arXiv.2501.06872
PDF version

This paper investigates the shared-memory Graph Transposition (GT) problem, a fundamental graph algorithm that is widely used in graph analytics and scientific computing.


Previous GT algorithms have significant memory requirements that are proportional to the number of vertices and threads which obstructs their use on large graphs. Moreover, atomic memory operations have become comparably fast on recent CPU architectures, which creates new opportunities for improving the performance of concurrent atomic accesses in GT.

We design PoTra, a GT algorithm which leverages graph structure and processor and memory architecture to optimize locality and performance. PoTra limits the size of additional data structures close to CPU cache sizes and utilizes the skewed degree distribution of graph datasets to optimize locality and performance. We present the performance model of PoTra to explain the connection between cache and memory response times and graph locality.


Our evaluation of PoTra on three CPU architectures and 20 real-world and synthetic graph datasets with up to 128 billion edges demonstrates that PoTra achieves up to 8.7 times speedup compared to previous works and if there is a performance loss it remains limited to 15.7%, on average.

Source code

The source code of PoTra is available on LaganLighter repository.

BibTex

@misc{PoTra,
     title={On Optimizing Locality of Graph Transposition on Modern Architectures}, 
     author={Mohsen {Koohi Esfahani} and Hans Vandierendonck},
     year={2025},
     eprint={2501.06872},
     archivePrefix={arXiv},
     primaryClass={cs.DC},
     url={https://arxiv.org/abs/2501.06872},
     doi={10.48550/arXiv.2501.06872} 
} 


LaganLighter

Accelerating Loading WebGraphs in ParaGrapher

PDF version
DOI: 10.48550/arXiv.2507.00716

ParaGrapher is a graph loading API and library that enables graph processing frameworks to load large-scale compressed graphs with minimal overhead. This capability accelerates the design and implementation of new high-performance graph algorithms and their evaluation on a wide range of graphs and across different frameworks.

However, our previous study identified two major limitations in ParaGrapher: inefficient utilization of high-bandwidth storage and reduced decompression bandwidth due to increased compression ratios. To address these limitations, we present two optimizations for ParaGrapher in this paper.

To improve storage utilization, particularly for high-bandwidth storage, we introduce ParaGrapher-FUSE (PG-Fuse) a filesystem based on the FUSE (Filesystem in User Space). PG-Fuse optimizes storage access by increasing the size of requested blocks, reducing the number of calls to the underlying filesystem, and caching the received blocks in memory for future calls.

To improve the decompression bandwidth, we introduce CompBin, a compact binary representation of the CSR format. CompBin facilitates direct accesses to neighbors while preventing storage usage for unused bytes.

Our evaluation on 12 real-world and synthetic graphs with up to 128 billion edges shows that PG-Fuse and CompBin achieve up to 7.6 and 21.8 times speedup, respectively.

Source Code

https://github.com/DIPSA-QUB/ParaGrapher

API Documentation

Please refer to the Wiki, https://github.com/DIPSA-QUB/ParaGrapher/wiki/API-Documentation, or download the PDF file using https://github.com/DIPSA-QUB/ParaGrapher/raw/main/doc/api.pdf .

BibTex

@misc{pg_fuse,
      title={Accelerating Loading WebGraphs in ParaGrapher}, 
      author={Mohsen {Koohi Esfahani}},
      year={2025},
      eprint={2507.00716},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2507.00716}, 
}

Related Posts & Source Code

ParaGrapher Web Page

Random Vertex Relabelling in LaganLighter

To evaluate the impacts of locality-optimizing reordering algorithms, a baseline is required. To create the baseline a random assignment of IDs to vertices may be used to produce a representation of the graph with reduced locality [ DOI:10.1109/ISPASS57527.2023.00029, DOI:10.1109/IISWC53511.2021.00020 ].

To that end, we create the random_ordering() function in relabel.c file. It consists a number of iterations. In each iteration, concurrent threads traverse the list of vertices and assign them new IDs. The function uses xoshiro to produce random numbers.

The alg4_randomize tests this function for a number of graphs. For each dataset, an initial plot of degree distribution of Neighbor to Neighbor Average ID Distance (N2N AID) [DOI:10.1109/IISWC53511.2021.00020] is created. Also, after each iteration of random_ordering() the N2N AID distribution is plotted. This shows the impacts of randomization.

The complete results for all graphs can be seen in this PDF file. The results for some graphs are in the following.

The algorithm has been executed on a machine with two AMD 7401 CPUs, 128 cores, 128 threads. The report created by the launcher is in the following.

Technical Posts


LaganLighter

Minimum Spanning Forest of MS-BioGraphs

We use MASTIFF to compute the weight of Minimum Spanning Forest (MST) of MS-BioGraphs while ignoring self-edges of the graphs.

– MS1

Using machine with 24 cores.

MSF weight: 109,915,787,546

– MS50

Using machine with 128 cores.

MSF weight: 416,318,200,808

MS-BioGraphs
Related Posts

Technical Posts


LaganLighter

Topology-Based Thread Affinity Setting (Thread Pinning) in OpenMP

In applications such as graph processing, it is important how threads are pinned on CPU cores as the threads that share resources (such as memory and cache) can accelerate the performance by processing consecutive blocks of input dataset, especially, when the dataset has a high-level of locality.

In LaganLighter, we read the CPU topology to specify how OpenMP threads are pinned. In omp.c file, the block starting with comment “Reading sibling groups of each node“, reads the “/sys/devices/system/cpu/cpu*/topology/thread_siblings” files to identify the sibling threads and three arrays ("node_sibling_groups_start_ID“, “sibling_group_cpus_start_offsets“, and “sibling_groups_cpus“) are used to store the sibling CPUs.

Then, in block starting with comment “Setting affinity of threads“, the sibling groups are read and based on the total number of threads requested by user, a number of threads with consecutive IDs are pinned to sibling CPUs.

For a machine with 24 cores, 48 hyperthreads, when 48 threads are requested, we have:

If 96 threads are created, we have:

Technical Posts


LaganLighter