MS-BioGraphs MSA200

NameMS-BioGraphs – MSA200
URLhttps://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-MSA200
Download Linkhttps://dx.doi.org/10.21227/gmd9-1534
Script for Downloading All Fileshttps://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-on-IEEE-DataPort/
Validating and Sample Codehttps://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation/
Graph ExplanationVertices represent proteins and each edge represents the sequence similarity between its two endpoints
Edge WeightedYes
DirectedYes
Number of Vertices1,757,323,526
Number of Edges500,444,322,597
Maximum In-Degree658,879
Maximum Out-Degree709,176
Minimum Weight98
Maximum Weight634,925
Number of Zero In-Degree Vertices6,437,984
Number of Zero Out-Degree Vertices7,471,315
Average In-Degree285.8
Average Out-Degree286.0
Size of The Largest Weakly Connected Component496,880,685,957
Number of Weakly Connected Components221,467,156
Creation DetailsMS-BioGraphs: Sequency Similarity Graph Datasets
FormatWebGraph
LicenseCC BY-NC-SA
QUB IDF2223-052
DOI10.5281/zenodo.7820815
Citation
Mohsen Koohi Esfahani, Sebastiano Vigna, 
Paolo Boldi, Hans Vandierendonck, Peter Kilpatrick, March 13, 2024, 
"MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets", 
IEEE Dataport, doi: https://dx.doi.org/10.21227/gmd9-1534.
Bibtex
@data{gmd9-1534-24,
doi = {10.21227/gmd9-1534},
url = {https://dx.doi.org/10.21227/gmd9-1534},
author = {Koohi Esfahani, Mohsen and Vigna, Sebastiano and Boldi, 
Paolo and Vandierendonck, Hans and Kilpatrick, Peter},
publisher = {IEEE Dataport},
title = {MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets},
year = {2024} }


Files

Underlying Graph The underlying graph in WebGraph format:
  • File: MSA200-underlying.graph, Size: 1,558,147,532,780 Bytes
  • File: MSA200-underlying.offsets, Size: 4,319,801,854 Bytes
  • File: MSA200-underlying.properties, Size: 1,517 Bytes
Total Size: 1,562,467,336,151 Bytes
These files are validated using ‘Edge Blocks SHAs File’ as follows.
Weights (Labels) The weights of the graph in WebGraph format:
  • File: MSA200-weights.labels, Size: 1,105,784,580,128 Bytes
  • File: MSA200-weights.labeloffsets, Size: 4,123,546,304 Bytes
  • File: MSA200-weights.properties, Size: 187 Bytes
Total Size: 1,109,908,126,619 Bytes
These files are validated using ‘Edge Blocks SHAs File’ as follows.
Edge Blocks SHAs File (Text) This file contains the shasums of edge blocks where each block contains 64 Million continuous edges and has one shasum for its 64M endpoints and one for its 64M edge weights.
The file is used to validate the underlying graph and the weights. For further explanation about validation process, please visit the https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation.
  • Name: MSA200_edges_shas.txt
  • Size: 895,200 Bytes
  • SHASUM: de1ac0ddce536168881ca2e49e6d5f0cf5b82bb5
Offsets (Binary) The offsets array of the CSX (Compressed Sparse Rows/Columns) graph in binary format and little endian order. It consists of |V|+1 8-Bytes elements.
The first and last values are 0 and |E|, respectively.
This array helps converting the graph (or parts of it) from WebGraph format to binary format by one pass over (related) edges.
  • Name: MSA200_offsets.bin
  • Size: 14,058,588,216 Bytes
  • SHASUM: c241d2dc4bdf46f60c1cd889ac367504d3f58805
WCC (Binary) The Weakly-Connected Compontent (WCC) array in binary format and little endian order.
This array consists of |V| 4-Bytes elements The vertices in the same component have the same values in the WCC array.
  • Name: MSA200-wcc.bin
  • Size: 7,029,294,104 Bytes
  • SHASUM: 2cb256d5e49e5dd0989715cb909fd8f27bfbd04c
Transposed’s Offsets (Binary) The offsets array of the transposed graph in binary format and little endian order. It consists of |V|+1 8-Bytes elements. The first and last values are 0 and |E|, respectively.
It helps to transpose the graph by performing one pass over edges.
  • Name: MSA200_trans_offsets.bin
  • Size: 14,058,588,216 Bytes
  • SHASUM: 47787ac64fb4485da02e3bcdc1696a814adfdb86
Names (tar.gz) This compressed file contains 120 files in CSV format using ‘;’ as the separator. Each row has two columns: ID of vertex and name of the sequence.
Note: If the graph has a ‘N2O Reordering’ file, the n2o array should be used to convert the vertex ID to old vertex ID which is used for identifying name of the protein in the `names.tar.gz` file.
  • Name: names.tar.gz
  • Size: 27,130,045,933 Bytes
  • SHASUM: ba00b58bbb2795445554058a681b573c751ef315
OJSON The charactersitics of the graph and shasums of the files.
It is in the open json format and needs a closing brace (}) to be appended before being passed to a json parser.
  • Name: MSA200.ojson
  • Size: 897 Bytes
  • SHASUM: 18e371cbb4bd9dbe6515e4528956ff32fb2e30c4


Plots

For the explanation about the plots, please refer to the MS-BioGraphs paper.
To have a better resolution, please click on the images.

In-Degree Distribution
Out-Degree Distribution
Weight Distribution
Vertex-Relative Weight Distribution
Degree Decomposition
Push and Pull Locality
Cell-Binned Average Weight Degree Distribution
Weakly-Connected Components Size Distribution


MS-BioGraphs


Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *