MS-BioGraphs: Trillion-Scale Public Real-World Sequence Similarity Graph Datasets – DIPSA: Data-Intensive Parallel Systems and Algorithms

Project Statement

Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets.
We (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family of publicly-available real-world graph datasets with up to 2.5 trillion edges, that is, 6.6 times greater than the largest graph published recently.
The largest graph is created by matching (i.e., all-to-all similarity aligning) 1.7 billion protein sequences. The MS-BioGraphs family includes also seven subgraphs with different sizes and direction types.

Project Steps

(i) Creating and validating the MS-BioGraphs by designing and engineering parallel and distributed algorithms and data structures to optimize performance and cluster utilization,
(ii) Extending the WebGraph framework with parallel compression for MS-BioGraphs as edge-weighted graphs, and
(iii) Analyzing the structural characteristics of MS-BioGraphs by comparing to other real-world graphs.

Datasets on IEEE DataPort

DOI: https://doi.org/10.21227/gmd9-1534

Validation & Sample Code

Please visit the MS-BioGraphs Validation post.
The ParaGrapher library (https://blogs.qub.ac.uk/DIPSA/ParaGrapher/) may be used to access MS-BioGraphs in C/C++.

Datasets, Source Code, and Publications

Minimum Spanning Forest of MS-BioGraphs9 August 2024
MS-BioGraphs on IEEE DataPort17 April 2024
ParaGrapher Source Code For WebGraph Types16 February 2024
On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets – BigData’23 (Short Paper)15 December 2023
Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs – IISWC’23 (Poster)2 October 2023
MS-BioGraphs: Sequence Similarity Graph Datasets30 August 2023
MS-BioGraphs MS10 August 2023
MS-BioGraphs MSA50010 August 2023
MS-BioGraphs MS20010 August 2023
MS-BioGraphs MSA20010 August 2023
MS-BioGraphs MS5010 August 2023
MS-BioGraphs MSA5010 August 2023
MS-BioGraphs MSA1010 August 2023
MS-BioGraphs MS110 August 2023
MS-BioGraphs Validation10 August 2023

Project Members

– Mohsen Koohi Esfahani
– Sebastiano Vigna, Università degli Studi di Milano
– Paolo Boldi, Università degli Studi di Milano
– Hans Vandierendonck
– Peter Kilpatrick

Naming

The name of each graph is started by two characters M and S as initials of Metaclust (as the source dataset) and Sequence similarity (as the real-world domain of the graph), respectively. The name of the directed subgraphs has a third character A that indicates the graph is asymmetric. The name of subgraphs is followed by 3 digits that show the relative-size of the subgraph in comparison to the MS graph, multiplied by a thousand.

Grants and Funding

– Kelvin-2 supercomputer (UKRI EPSRC grant EP/T022175/1)
– PhD scholarship from The Department for the Economy, Northern Ireland and QUB
– Energy Efficient Transprecision Techniques for Linear system Solvers
– SERICS project (PE00000014) under the NRRP MUR program funded by the EU – NGEU

License

The datasets are published under the CC BY-NC-SA license.
QUB IDF: 2223-052

Last update: Sep. 13th, 2024

Acknowledgements

We are grateful to
– IEEE DataPort
– Sean McKeever, Head of IT, EEECS, QUB and his team
– Ian Overton, School of Medicine, Dentistry and Biomedical Sciences, QUB
– Vaughan Purnell, Head of NI-HPC and his team
– Jesus Martinez-del-Rincon and SPRC committee, EEECS, QUB
– Martin Frith, University of Tokyo
– Ariful Azad, Indiana University
– Eurcom, France
– Unsplash, Pixabay, and Plotly