MS-BioGraphs MS

NameMS-BioGraphs – MS
Graph ExplanationVertices represent proteins and each edge represents the sequence similarity between its two endpoints
Edge WeightedYes
Number of Vertices1,757,323,526
Number of Edges2,488,069,027,875
Maximum Degree814,957
Minimum Weight98
Maximum Weight634,925
Number of Zero-Degree Vertices6,437,984
Average Degree1,415.8
Size of The Largest WCC2,486,890,448,664
Number of WCC148,861,367
Creation DetailsMS-BioGraphs: Sequency Similarity Graph Datasets
LicenseCC BY-NC-SA
QUB IDF2223-052
Koohi Esfahani, Mohsen, Boldi, Paolo, 
Vandierendonck, Hans, Kilpatrick, Peter, 
Vigna,Sebastiano. (2023). 
MS-BioGraphs - MS.
year = {2023},
author = {Mohsen Koohi Esfahani and Paolo Boldi and 
Hans Vandierendonck and Peter Kilpatrick and 
Sebastiano Vigna},
title = {{MS-BioGraphs - MS}},
doi = {10.5281/zenodo.7820808},
url = {},
howpublished= {\url{}}}

Files and Download Links

Underlying Graph The underlying graph in WebGraph format:
  • File: MS-underlying.graph, Size: 7,342,853,446,646 Bytes, Download Link
  • File: MS-underlying.offsets, Size: 5,341,385,503 Bytes, Download Link
  • File:, Size: 1,560 Bytes, Download Link
Total Size: 7,348,194,833,709 Bytes
These files are validated using ‘Edge Blocks SHAs File’ as follows.
Weights (Labels) The weights of the graph in WebGraph format:
  • File: MS-weights.labels, Size: 5,037,171,681,279 Bytes, Download Link
  • File: MS-weights.labeloffsets, Size: 5,070,752,590 Bytes, Download Link
  • File:, Size: 183 Bytes, Download Link
Total Size: 5,042,242,434,052 Bytes
These files are validated using ‘Edge Blocks SHAs File’ as follows.
Edge Blocks SHAs File (Text) This file contains the shasums of edge blocks where each block contains 64 Million continuous edges and has one shasum for its 64M endpoints and one for its 64M edge weights.
The file is used to validate the underlying graph and the weights. For further explanation about validation process, please visit the
  • Name: MS_edges_shas.txt
  • Size: 4,449,360 Bytes
  • SHASUM: 85d5b0896f8fa8a2b490ec6560937c45ced8b0d9
  • Download Link
Offsets (Binary) The offsets array of the CSX (Compressed Sparse Rows/Columns) graph in binary format and little endian order. It consists of |V|+1 8-Bytes elements.
The first and last values are 0 and |E|, respectively.
This array helps converting the graph (or parts of it) from WebGraph format to binary format by one pass over (related) edges.
  • Name: MS_offsets.bin
  • Size: 14,058,588,216 Bytes
  • SHASUM: 15c3defdbb92f7b1fe48a3fb20530d99fa30c616
  • Download Link
WCC (Binary) The Weekly-Connected Compontent (WCC) array in binary format and little endian order.
This array consists of |V| 4-Bytes elements The vertices in the same component have the same values in the WCC array.
  • Name: MS-wcc.bin
  • Size: 7,029,294,104 Bytes
  • SHASUM: 30f12b738dde8f62aecb94239796b169512e6710
  • Download Link
Names (tar.gz) This compressed file contains 120 files in CSV format using ‘;’ as the separator. Each row has two columns: ID of vertex and name of the sequence.
Note: If the graph has a ‘N2O Reordering’ file, the n2o array should be used to convert the vertex ID to old vertex ID which is used for identifying name of the protein in the `names.tar.gz` file.
  • Name: names.tar.gz
  • Size: 27,130,045,933 Bytes
  • SHASUM: ba00b58bbb2795445554058a681b573c751ef315
  • Download Link
OJSON The charactersitics of the graph and shasums of the files.
It is in the open json format and needs a closing brace (}) to be appended before being passed to a json parser.
  • Name: MS.ojson
  • Size: 700 Bytes
  • SHASUM: e2eb3fcdd0c22838971ed2edea8e1ed081a77282
  • Download Link


For the explanation about the plots, please refer to the MS-BioGraphs paper.
To have a better resolution, please click on the images.

Degree Distribution
Weight Distribution
Vertex-Relative Weight Distribution
Degree Decomposition
Cell-Binned Average Weight Degree Distribution
Weekly-Connected Components Size Distribution


Related Posts

Leave a Reply

Your email address will not be published.