Repository
https://github.com/DIPSA-QUB/MS-BioGraphs-Validation
Explanation
We provide a Shell script, validation.sh
, and a Java program, EdgeBlockSHA.java
, to verify the the correctness of the graphs. Each graph has a .ojson
file whose shasum
is verified by the value retreived from our server. Files such as offsets.bin
, wcc.bin
, n2o.bin
, trans_offsets.bin
, and edges_shas.txt
have shasum records in the ojson
file which is used for validation of these files.
The graph in WebGraph format has been compressed in MS??-underlying.*
and MS??-weights.*
files. In order to validate the compressed graph, the EdgeBlockSHA.java
is used. It is a parallel Java code that uses the WebGraph library to traverse the graph and calculate the shasum of blocks of edges (endpoints and weights). Then, the calculated results are matched with the edges_shas.txt
file of the graph.
It is also possible to validate some particular blocks by matching the calculated shasum with the relevant row in the edges_shas.txt
file. This file has a format such as the following. Each block contains 64 Million consecutive edges. The start of each block is identified by a vertex ID and its edge index. The Column endpoint_sha
is the shasum
of the 64 Million endpoints when stored as an array of 4-Bytes elements in the binary format and in the little endian order. Similarly, Column weights_sha
shows the shasum
of weights (labels). We have separated weights from endpoints as in some applications weights are not needed and therefore it is not necessary to read and validate them.
64MB blk#; vertex; edge index; endpoint_sha; weights_sha; 0; 0; 0; 509784b158cb9404241afb21d0ceaf590b88d2f2; 57da4ad7bb89c5922e436b0535d791fa8f40dffd; 1; 2315113; 705; fafc118563c1d7b5fbff64af56edd6a56524f479; 13b7a9ca60bfb0715d563218d0a1cd787b00a07c; 2; 4521625; 597; 4ed65aa07c8062a151166ef2e9bdb93e41d19357; 8158276bec426ee46eca9912759eb9bd57fcc957; 3; 6347361; 112; d02e8913c807c3f4ecde9c638e0ded5ab80ba819; 26bc3296de65cba6ac539cd96b79ae6f7a4d37be; 4; 8447869; 15; 61513c84db40124496cdf769516118b63598914f; 781b9f4372ac614e94d097017c756d015234deb6;
Requirements
JDK
with version > 15jq
wget
WebGraph Framework
Please visit https://webgraph.di.unimi.it .
ParaGrapher Graph Loading API and Library
The WebGraph formats can also be read using the ParaGrapher library: https://blogs.qub.ac.uk/DIPSA/ParaGrapher/.
License
Licensed under the GNU v3 General Public License, as published by the Free Software Foundation. You must not use this Software except in compliance with the terms of the License. Unless required by applicable law or agreed upon in writing, this Software is distributed on an “as is” basis, without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose, neither express nor implied.
Copyright 2022-2023 The Queen’s University of Belfast, Northern Ireland, UK
MS-BioGraphs
Related Posts
- Minimum Spanning Forest of MS-BioGraphs
- MS-BioGraphs on IEEE DataPort
- ParaGrapher Source Code For WebGraph Types
- On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets – BigData’23 (Short Paper)
- Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs – IISWC’23 (Poster)
- MS-BioGraphs: Sequence Similarity Graph Datasets
- MS-BioGraphs MS
- MS-BioGraphs MSA500
- MS-BioGraphs MS200
- MS-BioGraphs MSA200
- MS-BioGraphs MS50
- MS-BioGraphs MSA50
- MS-BioGraphs MSA10
- MS-BioGraphs MS1
- MS-BioGraphs Validation