MS-BioGraphs MS50

NameMS-BioGraphs – MS50
URLhttps://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-MS50
Download Linkhttps://doi.org/10.21227/gmd9-1534
Script for Downloading All Fileshttps://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-on-IEEE-DataPort/
Validating and Sample Codehttps://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation/
Graph ExplanationVertices represent proteins and each edge represents the sequence similarity between its two endpoints
Edge WeightedYes
DirectedNo
Number of Vertices585,603,088
Number of Edges124,783,559,600
Maximum Degree507,826
Minimum Weight900
Maximum Weight634,925
Number of Zero-Degree Vertices0
Average Degree213.1
Size of The Largest WCC102,256,631,195
Weight of Minimum Spanning Forest (ignoring self-edges)416,318,200,808
click for details
Number of WCC155,295,301
Creation DetailsMS-BioGraphs: Sequency Similarity Graph Datasets
FormatWebGraph
LicenseCC BY-NC-SA
QUB IDF2223-052
DOI10.5281/zenodo.7820819
Citation
Mohsen Koohi Esfahani, Sebastiano Vigna, 
Paolo Boldi, Hans Vandierendonck, Peter Kilpatrick, March 13, 2024, 
"MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets", 
IEEE Dataport, doi: https://doi.org/10.21227/gmd9-1534.
Bibtex
@data{gmd9-1534-24,
doi = {10.21227/gmd9-1534},
url = {https://doi.org/10.21227/gmd9-1534},
author = {Koohi Esfahani, Mohsen and Vigna, Sebastiano and Boldi, 
Paolo and Vandierendonck, Hans and Kilpatrick, Peter},
publisher = {IEEE Dataport},
title = {MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets},
year = {2024} }


Files

Underlying Graph The underlying graph in WebGraph format:
  • File: MS50-underlying.graph, Size: 347,621,279,586 Bytes
  • File: MS50-underlying.offsets, Size: 1,235,232,971 Bytes
  • File: MS50-underlying.properties, Size: 1,459 Bytes
Total Size: 348,856,514,016 Bytes
These files are validated using ‘Edge Blocks SHAs File’ as follows.
Weights (Labels) The weights of the graph in WebGraph format:
  • File: MS50-weights.labels, Size: 324,269,690,037 Bytes
  • File: MS50-weights.labeloffsets, Size: 1,221,399,047 Bytes
  • File: MS50-weights.properties, Size: 185 Bytes
Total Size: 325,491,089,269 Bytes
These files are validated using ‘Edge Blocks SHAs File’ as follows.
Edge Blocks SHAs File (Text) This file contains the shasums of edge blocks where each block contains 64 Million continuous edges and has one shasum for its 64M endpoints and one for its 64M edge weights.
The file is used to validate the underlying graph and the weights. For further explanation about validation process, please visit the https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation.
  • Name: MS50_edges_shas.txt
  • Size: 223,440 Bytes
  • SHASUM: 5d1bc449124448e9a6ed3bd439942e31f55d9f97
Offsets (Binary) The offsets array of the CSX (Compressed Sparse Rows/Columns) graph in binary format and little endian order. It consists of |V|+1 8-Bytes elements.
The first and last values are 0 and |E|, respectively.
This array helps converting the graph (or parts of it) from WebGraph format to binary format by one pass over (related) edges.
  • Name: MS50_offsets.bin
  • Size: 4,684,824,712 Bytes
  • SHASUM: b298f974167a1c64a8ba8e211a970c5b5d427137
WCC (Binary) The Weakly-Connected Compontent (WCC) array in binary format and little endian order.
This array consists of |V| 4-Bytes elements The vertices in the same component have the same values in the WCC array.
  • Name: MS50-wcc.bin
  • Size: 2,342,412,352 Bytes
  • SHASUM: 4d640ce445477191a3bc3dd00f09f712b9429af2
Names (tar.gz) This compressed file contains 120 files in CSV format using ‘;’ as the separator. Each row has two columns: ID of vertex and name of the sequence.
Note: If the graph has a ‘N2O Reordering’ file, the n2o array should be used to convert the vertex ID to old vertex ID which is used for identifying name of the protein in the `names.tar.gz` file.
  • Name: names.tar.gz
  • Size: 27,130,045,933 Bytes
  • SHASUM: ba00b58bbb2795445554058a681b573c751ef315
N2O Reordering (Binary) The New to Old (N2O) reordering array of the graph in binary format and little endian order.
It consists of |V| 4-Bytes elements and identifies the old ID of each vertex which is used in searching the name of vertex (protein) in the names.tar.gz file .
  • Name: MS50-n2o.bin
  • Size: 2,342,412,352 Bytes
  • SHASUM: 91939605bdde3eb67a013f80d4c2a84d1684ca8f
OJSON The charactersitics of the graph and shasums of the files.
It is in the open json format and needs a closing brace (}) to be appended before being passed to a json parser.
  • Name: MS50.ojson
  • Size: 751 Bytes
  • SHASUM: eb94812bea81cd40a3f33d6aaa5fdd63946ffc92


Plots

For the explanation about the plots, please refer to the MS-BioGraphs paper.
To have a better resolution, please click on the images.

Degree Distribution
Weight Distribution
Vertex-Relative Weight Distribution
Degree Decomposition
Cell-Binned Average Weight Degree Distribution
Weakly-Connected Components Size Distribution


MS-BioGraphs


Related Posts