Name | MS-BioGraphs – MSA500 |
URL | https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-MSA500 |
Download Link | https://doi.org/10.21227/gmd9-1534 |
Script for Downloading All Files | https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-on-IEEE-DataPort/ |
Validating and Sample Code | https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation/ |
Graph Explanation | Vertices represent proteins and each edge represents the sequence similarity between its two endpoints |
Edge Weighted | Yes |
Directed | Yes |
Number of Vertices | 1,757,323,526 |
Number of Edges | 1,244,904,754,157 |
Maximum In-Degree | 229,442 |
Maximum Out-Degree | 814,461 |
Minimum Weight | 98 |
Maximum Weight | 634,925 |
Number of Zero In-Degree Vertices | 6,437,984 |
Number of Zero Out-Degree Vertices | 16,843,087 |
Average In-Degree | 711.0 |
Average Out-Degree | 715.3 |
Size of The Largest Weakly Connected Component | 1,244,203,865,823 |
Number of Weakly Connected Components | 148,861,367 |
Creation Details | MS-BioGraphs: Sequency Similarity Graph Datasets |
Format | WebGraph |
License | CC BY-NC-SA |
QUB IDF | 2223-052 |
DOI | 10.5281/zenodo.7820810 |
Citation | Mohsen Koohi Esfahani, Sebastiano Vigna,
Paolo Boldi, Hans Vandierendonck, Peter Kilpatrick, March 13, 2024,
"MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets",
IEEE Dataport, doi: https://doi.org/10.21227/gmd9-1534. |
Bibtex | @data{gmd9-1534-24,
doi = {10.21227/gmd9-1534},
url = {https://doi.org/10.21227/gmd9-1534},
author = {Koohi Esfahani, Mohsen and Vigna, Sebastiano and Boldi,
Paolo and Vandierendonck, Hans and Kilpatrick, Peter},
publisher = {IEEE Dataport},
title = {MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets},
year = {2024} } |
Underlying Graph |
The underlying graph in WebGraph format:
- File: MSA500-underlying.graph, Size: 3,755,604,574,487 Bytes
- File: MSA500-underlying.offsets, Size: 4,811,273,232 Bytes
- File: MSA500-underlying.properties, Size: 1,537 Bytes
Total Size: 3,760,415,849,256 Bytes
These files are validated using ‘Edge Blocks SHAs File’ as follows.
|
Weights (Labels) |
The weights of the graph in WebGraph format:
- File: MSA500-weights.labels, Size: 2,520,671,185,509 Bytes
- File: MSA500-weights.labeloffsets, Size: 4,554,987,345 Bytes
- File: MSA500-weights.properties, Size: 187 Bytes
Total Size: 2,525,226,173,041 Bytes
These files are validated using ‘Edge Blocks SHAs File’ as follows.
|
Edge Blocks SHAs File (Text) |
This file contains the shasums of edge blocks where each block contains
64 Million continuous edges and has one shasum for its 64M endpoints and
one for its 64M edge weights.
The file is used to validate the underlying graph and the weights.
For further explanation about validation process, please visit
the
https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation.
- Name: MSA500_edges_shas.txt
- Size: 2,226,360 Bytes
- SHASUM: d9f692b6f4770f282ea62936293baf6a649c2b91
|
Offsets (Binary) |
The offsets array of the CSX (Compressed Sparse Rows/Columns) graph in binary
format and little endian order. It consists of |V|+1 8-Bytes elements.
The first and last values are 0 and |E|, respectively.
This array helps converting the graph (or parts of it) from WebGraph format
to binary format by one pass over (related) edges.
- Name: MSA500_offsets.bin
- Size: 14,058,588,216 Bytes
- SHASUM: 3eab31d99426ed9f96af6b258fd1253544ba5461
|
WCC (Binary) |
The Weakly-Connected Compontent (WCC) array in binary format and little endian order.
This array consists of |V| 4-Bytes elements
The vertices in the same component have the same values in the WCC array.
- Name: MSA500-wcc.bin
- Size: 7,029,294,104 Bytes
- SHASUM: 30f12b738dde8f62aecb94239796b169512e6710
|
Transposed’s Offsets (Binary) |
The offsets array of the transposed graph in binary format and little endian order.
It consists of |V|+1 8-Bytes elements. The first and last values are 0 and |E|, respectively.
It helps to transpose the graph by performing one pass over edges.
- Name: MSA500_trans_offsets.bin
- Size: 14,058,588,216 Bytes
- SHASUM: 220a2a5c60baaedc8913720862b535ba6cabb5bd
|
Names (tar.gz) |
This compressed file contains 120 files in CSV format using ‘;’ as the separator.
Each row has two columns: ID of vertex and name of the sequence.
Note: If the graph has a ‘N2O Reordering’ file, the n2o array should
be used to convert the vertex ID to old vertex ID which is used for identifying
name of the protein in the `names.tar.gz` file.
- Name: names.tar.gz
- Size: 27,130,045,933 Bytes
- SHASUM: ba00b58bbb2795445554058a681b573c751ef315
|
OJSON |
The charactersitics of the graph and shasums of the files.
It is in the open json format and needs a closing brace (}) to be appended
before being passed to a json parser.
- Name: MSA500.ojson
- Size: 902 Bytes
- SHASUM: 5eaebdff2dc56925a0b4751f579ebeabb6e3bee5
|