Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets.
To ensure continuation of this progress, we (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family of publicly-available real-world graph datasets with up to 2.5 trillion edges, that is, 6.6 times greater than the largest graph published recently.
The largest graph is created by matching (i.e., all-to-all similarity aligning) 1.7 billion protein sequences. The MS-BioGraphs family includes also seven subgraphs with different sizes and direction types.
(i) Creating and validating the MS-BioGraphs by designing and engineering parallel and distributed algorithms and data structures to optimize performance and cluster utilization,
(ii) Extending the WebGraph framework with parallel compression for MS-BioGraphs as edge-weighted graphs, and
(iii) Analyzing the structural characteristics of MS-BioGraphs by comparing to other real-world graphs.
Datasets, Source Code, and Publications
- MS-BioGraphs: Sequence Similarity Graph Datasets
- Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs – IISWC’23 (Poster)
- MS-BioGraphs MS
- MS-BioGraphs MSA500
- MS-BioGraphs MS200
- MS-BioGraphs MSA200
- MS-BioGraphs MS50
- MS-BioGraphs MSA50
- MS-BioGraphs MSA10
- MS-BioGraphs MS1
- MS-BioGraphs Validation
The name of each graph is started by two characters M and S as initials of Metaclust (as the source dataset) and Sequence similarity (as the real-world domain of the graph), respectively. The name of the directed subgraphs has a third character A that indicates the graph is asymmetric. The name of subgraphs is followed by 3 digits that show the relative-size of the subgraph in comparison to the MS graph, multiplied by a thousand.
Validation & Sample Code
Please visit the MS-BioGraphs Validation post.
Grants and Funding
– Kelvin-2 supercomputer (UKRI EPSRC grant EP/T022175/1)
– PhD scholarship from The Department for the Economy, Northern Ireland and QUB
– QUB EEECS SPRC grant A2565EEC
– SERICS project (PE00000014) under the NRRP MUR program funded by the EU – NGEU
The datasets are published under the CC BY-NC-SA license.
QUB IDF: 2223-052
We are grateful to
– Sean McKeever, Head of IT, EEECS, QUB and his team
– Dr. Ian Overton, School of Medicine, Dentistry and Biomedical Sciences, QUB
– Dr. Vaughan Purnell, Head of NI-HPC and his team
– Dr. Jesus Martinez-del-Rincon and SPRC committee, EEECS, QUB
– Prof. Martin Frith, University of Tokyo
– Dr. Ariful Azad, Indiana University
– Eurecom, France
Last update: Sep. 19th, 2023