We provide a Shell script,
validation.sh, and a Java program,
EdgeBlockSHA.java, to verify the the correctness of the graphs. Each graph has a
.ojson file whose
shasum is verified by the value retreived from our server. Files such as
edges_shas.txt have shasum records in the
ojson file which is used for validation of these files.
The graph in WebGraph format has been compressed in
MS??-weights.* files. In order to validate the compressed graph, the
EdgeBlockSHA.java is used. It is a parallel Java code that uses the WebGraph library to traverse the graph and calculate the shasum of blocks of edges (endpoints and weights). Then, the calculated results are matched with the
edges_shas.txt file of the graph.
It is also possible to validate some particular blocks by matching the calculated shasum with the relevant row in the
edges_shas.txt file. This file has a format such as the following. Each block contains 64 Million consecutive edges. The start of each block is identified by a vertex ID and its edge index. The Column
endpoint_sha is the
shasum of the 64 Million endpoints when stored as an array of 4-Bytes elements in the binary format and in the little endian order. Similarly, Column
weights_sha shows the
shasum of weights (labels). We have separated weights from endpoints as in some applications weights are not needed and therefore it is not necessary to read and validate them.
64MB blk#; vertex; edge index; endpoint_sha; weights_sha; 0; 0; 0; 509784b158cb9404241afb21d0ceaf590b88d2f2; 57da4ad7bb89c5922e436b0535d791fa8f40dffd; 1; 2315113; 705; fafc118563c1d7b5fbff64af56edd6a56524f479; 13b7a9ca60bfb0715d563218d0a1cd787b00a07c; 2; 4521625; 597; 4ed65aa07c8062a151166ef2e9bdb93e41d19357; 8158276bec426ee46eca9912759eb9bd57fcc957; 3; 6347361; 112; d02e8913c807c3f4ecde9c638e0ded5ab80ba819; 26bc3296de65cba6ac539cd96b79ae6f7a4d37be; 4; 8447869; 15; 61513c84db40124496cdf769516118b63598914f; 781b9f4372ac614e94d097017c756d015234deb6;
JDKwith version > 15
Please visit https://webgraph.di.unimi.it .
Licensed under the GNU v3 General Public License, as published by the Free Software Foundation. You must not use this Software except in compliance with the terms of the License. Unless required by applicable law or agreed upon in writing, this Software is distributed on an “as is” basis, without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose, neither express nor implied.
- MS-BioGraphs: Sequence Similarity Graph Datasets
- Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs – IISWC’23 (Poster)
- MS-BioGraphs MS
- MS-BioGraphs MSA500
- MS-BioGraphs MS200
- MS-BioGraphs MSA200
- MS-BioGraphs MS50
- MS-BioGraphs MSA50
- MS-BioGraphs MSA10
- MS-BioGraphs MS1
- MS-BioGraphs Validation