MS-BioGraphs: Trillion-Scale Public Real-World Sequence Similarity Graph Datasets

Project Statement
Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets.
We (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family of publicly-available real-world graph datasets with up to 2.5 trillion edges, that is, 6.6 times greater than the largest graph published recently.
The largest graph is created by matching (i.e., all-to-all similarity aligning) 1.7 billion protein sequences. The MS-BioGraphs family includes also seven subgraphs with different sizes and direction types.

Project Steps
(i) Creating and validating the MS-BioGraphs by designing and engineering parallel and distributed algorithms and data structures to optimize performance and cluster utilization,
(ii) Extending the WebGraph framework with parallel compression for MS-BioGraphs as edge-weighted graphs, and
(iii) Analyzing the structural characteristics of MS-BioGraphs by comparing to other real-world graphs.

Datasets on IEEE DataPort
DOI: https://doi.org/10.21227/gmd9-1534
Permanent short link: http://ieee-dataport.org/12652

Validation & Sample Code
Please visit the MS-BioGraphs Validation post.

Datasets, Source Code, and Publications



Project Members
Mohsen Koohi Esfahani
Sebastiano Vigna, Università degli Studi di Milano
Paolo Boldi, Università degli Studi di Milano
Hans Vandierendonck
Peter Kilpatrick

Naming
The name of each graph is started by two characters M and S as initials of Metaclust (as the source dataset) and Sequence similarity (as the real-world domain of the graph), respectively. The name of the directed subgraphs has a third character A that indicates the graph is asymmetric. The name of subgraphs is followed by 3 digits that show the relative-size of the subgraph in comparison to the MS graph, multiplied by a thousand.

Grants and Funding
– Kelvin-2 supercomputer (UKRI EPSRC grant EP/T022175/1)
– PhD scholarship from The Department for the Economy, Northern Ireland and QUB
– Energy Efficient Transprecision Techniques for Linear system Solvers
– SERICS project (PE00000014) under the NRRP MUR program funded by the EU – NGEU

License
The datasets are published under the CC BY-NC-SA license.
QUB IDF: 2223-052

Last update: April 17th, 2024

Acknowledgements
We are grateful to
IEEE DataPort
– Sean McKeever, Head of IT, EEECS, QUB and his team
– Ian Overton, School of Medicine, Dentistry and Biomedical Sciences, QUB
– Vaughan Purnell, Head of NI-HPC and his team
– Jesus Martinez-del-Rincon and SPRC committee, EEECS, QUB
– Martin Frith, University of Tokyo
– Ariful Azad, Indiana University
– Eurcom, France
– Unsplash, Pixabay, and Plotly