Project Statement
Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets.
We (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family of publicly-available real-world graph datasets with up to 2.5 trillion edges, that is, 6.6 times greater than the largest graph published recently.
The largest graph is created by matching (i.e., all-to-all similarity aligning) 1.7 billion protein sequences. The MS-BioGraphs family includes also seven subgraphs with different sizes and direction types.
Project Steps
(i) Creating and validating the MS-BioGraphs by designing and engineering parallel and distributed algorithms and data structures to optimize performance and cluster utilization,
(ii) Extending the WebGraph framework with parallel compression for MS-BioGraphs as edge-weighted graphs, and
(iii) Analyzing the structural characteristics of MS-BioGraphs by comparing to other real-world graphs.
Datasets on IEEE DataPort
DOI: https://doi.org/10.21227/gmd9-1534
Permanent short link: http://ieee-dataport.org/12652
Validation & Sample Code
Please visit the MS-BioGraphs Validation post.
The ParaGrapher library (https://blogs.qub.ac.uk/DIPSA/ParaGrapher/) may be used to access MS-BioGraphs in C/C++.
Datasets, Source Code, and Publications
- Minimum Spanning Forest of MS-BioGraphs
- MS-BioGraphs on IEEE DataPort
- ParaGrapher Source Code For WebGraph Types
- On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets – BigData’23 (Short Paper)
- Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs – IISWC’23 (Poster)
- MS-BioGraphs: Sequence Similarity Graph Datasets
- MS-BioGraphs MS
- MS-BioGraphs MSA500
- MS-BioGraphs MS200
- MS-BioGraphs MSA200
- MS-BioGraphs MS50
- MS-BioGraphs MSA50
- MS-BioGraphs MSA10
- MS-BioGraphs MS1
- MS-BioGraphs Validation
Project Members
– Mohsen Koohi Esfahani
– Sebastiano Vigna, Università degli Studi di Milano
– Paolo Boldi, Università degli Studi di Milano
– Hans Vandierendonck
– Peter Kilpatrick
Naming
The name of each graph is started by two characters M and S as initials of Metaclust (as the source dataset) and Sequence similarity (as the real-world domain of the graph), respectively. The name of the directed subgraphs has a third character A that indicates the graph is asymmetric. The name of subgraphs is followed by 3 digits that show the relative-size of the subgraph in comparison to the MS graph, multiplied by a thousand.
Grants and Funding
– Kelvin-2 supercomputer (UKRI EPSRC grant EP/T022175/1)
– PhD scholarship from The Department for the Economy, Northern Ireland and QUB
– Energy Efficient Transprecision Techniques for Linear system Solvers
– SERICS project (PE00000014) under the NRRP MUR program funded by the EU – NGEU
License
The datasets are published under the CC BY-NC-SA license.
QUB IDF: 2223-052
Last update: Sep. 13th, 2024
Acknowledgements
We are grateful to
– IEEE DataPort
– Sean McKeever, Head of IT, EEECS, QUB and his team
– Ian Overton, School of Medicine, Dentistry and Biomedical Sciences, QUB
– Vaughan Purnell, Head of NI-HPC and his team
– Jesus Martinez-del-Rincon and SPRC committee, EEECS, QUB
– Martin Frith, University of Tokyo
– Ariful Azad, Indiana University
– Eurcom, France
– Unsplash, Pixabay, and Plotly