MS-BioGraphs MSA500 – DIPSA: Data-Intensive Parallel Systems and Algorithms

Name	MS-BioGraphs – MSA500
URL	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-MSA500
Download Link	https://doi.org/10.21227/gmd9-1534
Script for Downloading All Files	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-on-IEEE-DataPort/
Validating and Sample Code	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation/
Graph Explanation	Vertices represent proteins and each edge represents the sequence similarity between its two endpoints
Edge Weighted	Yes
Directed	Yes
Number of Vertices	1,757,323,526
Number of Edges	1,244,904,754,157
Maximum In-Degree	229,442
Maximum Out-Degree	814,461
Minimum Weight	98
Maximum Weight	634,925
Number of Zero In-Degree Vertices	6,437,984
Number of Zero Out-Degree Vertices	16,843,087
Average In-Degree	711.0
Average Out-Degree	715.3
Size of The Largest Weakly Connected Component	1,244,203,865,823
Number of Weakly Connected Components	148,861,367
Creation Details	MS-BioGraphs: Sequency Similarity Graph Datasets
Format	WebGraph
License	CC BY-NC-SA
QUB IDF	2223-052
DOI	10.5281/zenodo.7820810
Citation	Mohsen Koohi Esfahani, Sebastiano Vigna, Paolo Boldi, Hans Vandierendonck, Peter Kilpatrick, March 13, 2024, "MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets", IEEE Dataport, doi: https://doi.org/10.21227/gmd9-1534.
Bibtex	@data{gmd9-1534-24, doi = {10.21227/gmd9-1534}, url = {https://doi.org/10.21227/gmd9-1534}, author = {Koohi Esfahani, Mohsen and Vigna, Sebastiano and Boldi, Paolo and Vandierendonck, Hans and Kilpatrick, Peter}, publisher = {IEEE Dataport}, title = {MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets}, year = {2024} }

Files

Underlying Graph	The underlying graph in WebGraph format: File: MSA500-underlying.graph, Size: 3,755,604,574,487 Bytes File: MSA500-underlying.offsets, Size: 4,811,273,232 Bytes File: MSA500-underlying.properties, Size: 1,537 Bytes Total Size: 3,760,415,849,256 Bytes These files are validated using ‘Edge Blocks SHAs File’ as follows.
Weights (Labels)	The weights of the graph in WebGraph format: File: MSA500-weights.labels, Size: 2,520,671,185,509 Bytes File: MSA500-weights.labeloffsets, Size: 4,554,987,345 Bytes File: MSA500-weights.properties, Size: 187 Bytes Total Size: 2,525,226,173,041 Bytes These files are validated using ‘Edge Blocks SHAs File’ as follows.
Edge Blocks SHAs File (Text)	This file contains the shasums of edge blocks where each block contains 64 Million continuous edges and has one shasum for its 64M endpoints and one for its 64M edge weights. The file is used to validate the underlying graph and the weights. For further explanation about validation process, please visit the https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation. Name: MSA500_edges_shas.txt Size: 2,226,360 Bytes SHASUM: d9f692b6f4770f282ea62936293baf6a649c2b91
Offsets (Binary)	The offsets array of the CSX (Compressed Sparse Rows/Columns) graph in binary format and little endian order. It consists of \|V\|+1 8-Bytes elements. The first and last values are 0 and \|E\|, respectively. This array helps converting the graph (or parts of it) from WebGraph format to binary format by one pass over (related) edges. Name: MSA500_offsets.bin Size: 14,058,588,216 Bytes SHASUM: 3eab31d99426ed9f96af6b258fd1253544ba5461
WCC (Binary)	The Weakly-Connected Compontent (WCC) array in binary format and little endian order. This array consists of \|V\| 4-Bytes elements The vertices in the same component have the same values in the WCC array. Name: MSA500-wcc.bin Size: 7,029,294,104 Bytes SHASUM: 30f12b738dde8f62aecb94239796b169512e6710
Transposed’s Offsets (Binary)	The offsets array of the transposed graph in binary format and little endian order. It consists of \|V\|+1 8-Bytes elements. The first and last values are 0 and \|E\|, respectively. It helps to transpose the graph by performing one pass over edges. Name: MSA500_trans_offsets.bin Size: 14,058,588,216 Bytes SHASUM: 220a2a5c60baaedc8913720862b535ba6cabb5bd
Names (tar.gz)	This compressed file contains 120 files in CSV format using ‘;’ as the separator. Each row has two columns: ID of vertex and name of the sequence. Note: If the graph has a ‘N2O Reordering’ file, the n2o array should be used to convert the vertex ID to old vertex ID which is used for identifying name of the protein in the `names.tar.gz` file. Name: names.tar.gz Size: 27,130,045,933 Bytes SHASUM: ba00b58bbb2795445554058a681b573c751ef315
OJSON	The charactersitics of the graph and shasums of the files. It is in the open json format and needs a closing brace (}) to be appended before being passed to a json parser. Name: MSA500.ojson Size: 902 Bytes SHASUM: 5eaebdff2dc56925a0b4751f579ebeabb6e3bee5

Plots

For the explanation about the plots, please refer to the MS-BioGraphs paper.
To have a better resolution, please click on the images.

In-Degree Distribution
Out-Degree Distribution
Weight Distribution
Vertex-Relative Weight Distribution
Degree Decomposition
Push and Pull Locality
Cell-Binned Average Weight Degree Distribution
Weakly-Connected Components Size Distribution