MS-BioGraphs MSA200

Posted on 10 August 2023 by Mohsen

Name	MS-BioGraphs – MSA200
URL	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-MSA200
Download Link	https://dx.doi.org/10.21227/gmd9-1534
Script for Downloading All Files	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-on-IEEE-DataPort/
Validating and Sample Code	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation/
Graph Explanation	Vertices represent proteins and each edge represents the sequence similarity between its two endpoints
Edge Weighted	Yes
Directed	Yes
Number of Vertices	1,757,323,526
Number of Edges	500,444,322,597
Maximum In-Degree	658,879
Maximum Out-Degree	709,176
Minimum Weight	98
Maximum Weight	634,925
Number of Zero In-Degree Vertices	6,437,984
Number of Zero Out-Degree Vertices	7,471,315
Average In-Degree	285.8
Average Out-Degree	286.0
Size of The Largest Weakly Connected Component	496,880,685,957
Number of Weakly Connected Components	221,467,156
Creation Details	MS-BioGraphs: Sequency Similarity Graph Datasets
Format	WebGraph
License	CC BY-NC-SA
QUB IDF	2223-052
DOI	10.5281/zenodo.7820815
Citation	Mohsen Koohi Esfahani, Sebastiano Vigna, Paolo Boldi, Hans Vandierendonck, Peter Kilpatrick, March 13, 2024, "MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets", IEEE Dataport, doi: https://dx.doi.org/10.21227/gmd9-1534.
Bibtex	@data{gmd9-1534-24, doi = {10.21227/gmd9-1534}, url = {https://dx.doi.org/10.21227/gmd9-1534}, author = {Koohi Esfahani, Mohsen and Vigna, Sebastiano and Boldi, Paolo and Vandierendonck, Hans and Kilpatrick, Peter}, publisher = {IEEE Dataport}, title = {MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets}, year = {2024} }

Files

Underlying Graph	The underlying graph in WebGraph format: File: MSA200-underlying.graph, Size: 1,558,147,532,780 Bytes File: MSA200-underlying.offsets, Size: 4,319,801,854 Bytes File: MSA200-underlying.properties, Size: 1,517 Bytes Total Size: 1,562,467,336,151 Bytes These files are validated using ‘Edge Blocks SHAs File’ as follows.
Weights (Labels)	The weights of the graph in WebGraph format: File: MSA200-weights.labels, Size: 1,105,784,580,128 Bytes File: MSA200-weights.labeloffsets, Size: 4,123,546,304 Bytes File: MSA200-weights.properties, Size: 187 Bytes Total Size: 1,109,908,126,619 Bytes These files are validated using ‘Edge Blocks SHAs File’ as follows.
Edge Blocks SHAs File (Text)	This file contains the shasums of edge blocks where each block contains 64 Million continuous edges and has one shasum for its 64M endpoints and one for its 64M edge weights. The file is used to validate the underlying graph and the weights. For further explanation about validation process, please visit the https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation. Name: MSA200_edges_shas.txt Size: 895,200 Bytes SHASUM: de1ac0ddce536168881ca2e49e6d5f0cf5b82bb5
Offsets (Binary)	The offsets array of the CSX (Compressed Sparse Rows/Columns) graph in binary format and little endian order. It consists of \|V\|+1 8-Bytes elements. The first and last values are 0 and \|E\|, respectively. This array helps converting the graph (or parts of it) from WebGraph format to binary format by one pass over (related) edges. Name: MSA200_offsets.bin Size: 14,058,588,216 Bytes SHASUM: c241d2dc4bdf46f60c1cd889ac367504d3f58805
WCC (Binary)	The Weakly-Connected Compontent (WCC) array in binary format and little endian order. This array consists of \|V\| 4-Bytes elements The vertices in the same component have the same values in the WCC array. Name: MSA200-wcc.bin Size: 7,029,294,104 Bytes SHASUM: 2cb256d5e49e5dd0989715cb909fd8f27bfbd04c
Transposed’s Offsets (Binary)	The offsets array of the transposed graph in binary format and little endian order. It consists of \|V\|+1 8-Bytes elements. The first and last values are 0 and \|E\|, respectively. It helps to transpose the graph by performing one pass over edges. Name: MSA200_trans_offsets.bin Size: 14,058,588,216 Bytes SHASUM: 47787ac64fb4485da02e3bcdc1696a814adfdb86
Names (tar.gz)	This compressed file contains 120 files in CSV format using ‘;’ as the separator. Each row has two columns: ID of vertex and name of the sequence. Note: If the graph has a ‘N2O Reordering’ file, the n2o array should be used to convert the vertex ID to old vertex ID which is used for identifying name of the protein in the `names.tar.gz` file. Name: names.tar.gz Size: 27,130,045,933 Bytes SHASUM: ba00b58bbb2795445554058a681b573c751ef315
OJSON	The charactersitics of the graph and shasums of the files. It is in the open json format and needs a closing brace (}) to be appended before being passed to a json parser. Name: MSA200.ojson Size: 897 Bytes SHASUM: 18e371cbb4bd9dbe6515e4528956ff32fb2e30c4

Plots

For the explanation about the plots, please refer to the MS-BioGraphs paper.
To have a better resolution, please click on the images.

In-Degree Distribution
Out-Degree Distribution
Weight Distribution
Vertex-Relative Weight Distribution
Degree Decomposition
Push and Pull Locality
Cell-Binned Average Weight Degree Distribution
Weakly-Connected Components Size Distribution

MS-BioGraphs

Related Posts

MS-BioGraphs on IEEE DataPort17 April 2024
ParaGrapher Source Code For WebGraph Types16 February 2024
On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets – BigData’23 (Short Paper)15 December 2023
Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs – IISWC’23 (Poster)2 October 2023
MS-BioGraphs: Sequence Similarity Graph Datasets30 August 2023
MS-BioGraphs MS10 August 2023
MS-BioGraphs MSA50010 August 2023
MS-BioGraphs MS20010 August 2023
MS-BioGraphs MSA20010 August 2023
MS-BioGraphs MS5010 August 2023
MS-BioGraphs MSA5010 August 2023
MS-BioGraphs MSA1010 August 2023
MS-BioGraphs MS110 August 2023
MS-BioGraphs Validation10 August 2023

MS-BioGraphs MS50

Posted on 10 August 2023 by Mohsen

Name	MS-BioGraphs – MS50
URL	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-MS50
Download Link	https://dx.doi.org/10.21227/gmd9-1534
Script for Downloading All Files	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-on-IEEE-DataPort/
Validating and Sample Code	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation/
Graph Explanation	Vertices represent proteins and each edge represents the sequence similarity between its two endpoints
Edge Weighted	Yes
Directed	No
Number of Vertices	585,603,088
Number of Edges	124,783,559,600
Maximum Degree	507,826
Minimum Weight	900
Maximum Weight	634,925
Number of Zero-Degree Vertices	0
Average Degree	213.1
Size of The Largest WCC	102,256,631,195
Number of WCC	155,295,301
Creation Details	MS-BioGraphs: Sequency Similarity Graph Datasets
Format	WebGraph
License	CC BY-NC-SA
QUB IDF	2223-052
DOI	10.5281/zenodo.7820819
Citation	Mohsen Koohi Esfahani, Sebastiano Vigna, Paolo Boldi, Hans Vandierendonck, Peter Kilpatrick, March 13, 2024, "MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets", IEEE Dataport, doi: https://dx.doi.org/10.21227/gmd9-1534.
Bibtex	@data{gmd9-1534-24, doi = {10.21227/gmd9-1534}, url = {https://dx.doi.org/10.21227/gmd9-1534}, author = {Koohi Esfahani, Mohsen and Vigna, Sebastiano and Boldi, Paolo and Vandierendonck, Hans and Kilpatrick, Peter}, publisher = {IEEE Dataport}, title = {MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets}, year = {2024} }

Files

Underlying Graph	The underlying graph in WebGraph format: File: MS50-underlying.graph, Size: 347,621,279,586 Bytes File: MS50-underlying.offsets, Size: 1,235,232,971 Bytes File: MS50-underlying.properties, Size: 1,459 Bytes Total Size: 348,856,514,016 Bytes These files are validated using ‘Edge Blocks SHAs File’ as follows.
Weights (Labels)	The weights of the graph in WebGraph format: File: MS50-weights.labels, Size: 324,269,690,037 Bytes File: MS50-weights.labeloffsets, Size: 1,221,399,047 Bytes File: MS50-weights.properties, Size: 185 Bytes Total Size: 325,491,089,269 Bytes These files are validated using ‘Edge Blocks SHAs File’ as follows.
Edge Blocks SHAs File (Text)	This file contains the shasums of edge blocks where each block contains 64 Million continuous edges and has one shasum for its 64M endpoints and one for its 64M edge weights. The file is used to validate the underlying graph and the weights. For further explanation about validation process, please visit the https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation. Name: MS50_edges_shas.txt Size: 223,440 Bytes SHASUM: 5d1bc449124448e9a6ed3bd439942e31f55d9f97
Offsets (Binary)	The offsets array of the CSX (Compressed Sparse Rows/Columns) graph in binary format and little endian order. It consists of \|V\|+1 8-Bytes elements. The first and last values are 0 and \|E\|, respectively. This array helps converting the graph (or parts of it) from WebGraph format to binary format by one pass over (related) edges. Name: MS50_offsets.bin Size: 4,684,824,712 Bytes SHASUM: b298f974167a1c64a8ba8e211a970c5b5d427137
WCC (Binary)	The Weakly-Connected Compontent (WCC) array in binary format and little endian order. This array consists of \|V\| 4-Bytes elements The vertices in the same component have the same values in the WCC array. Name: MS50-wcc.bin Size: 2,342,412,352 Bytes SHASUM: 4d640ce445477191a3bc3dd00f09f712b9429af2
Names (tar.gz)	This compressed file contains 120 files in CSV format using ‘;’ as the separator. Each row has two columns: ID of vertex and name of the sequence. Note: If the graph has a ‘N2O Reordering’ file, the n2o array should be used to convert the vertex ID to old vertex ID which is used for identifying name of the protein in the `names.tar.gz` file. Name: names.tar.gz Size: 27,130,045,933 Bytes SHASUM: ba00b58bbb2795445554058a681b573c751ef315
N2O Reordering (Binary)	The New to Old (N2O) reordering array of the graph in binary format and little endian order. It consists of \|V\| 4-Bytes elements and identifies the old ID of each vertex which is used in searching the name of vertex (protein) in the names.tar.gz file . Name: MS50-n2o.bin Size: 2,342,412,352 Bytes SHASUM: 91939605bdde3eb67a013f80d4c2a84d1684ca8f
OJSON	The charactersitics of the graph and shasums of the files. It is in the open json format and needs a closing brace (}) to be appended before being passed to a json parser. Name: MS50.ojson Size: 751 Bytes SHASUM: eb94812bea81cd40a3f33d6aaa5fdd63946ffc92

Plots

For the explanation about the plots, please refer to the MS-BioGraphs paper.
To have a better resolution, please click on the images.

Degree Distribution
Weight Distribution
Vertex-Relative Weight Distribution
Degree Decomposition
Cell-Binned Average Weight Degree Distribution
Weakly-Connected Components Size Distribution

MS-BioGraphs

Related Posts

MS-BioGraphs on IEEE DataPort17 April 2024
ParaGrapher Source Code For WebGraph Types16 February 2024
On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets – BigData’23 (Short Paper)15 December 2023
Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs – IISWC’23 (Poster)2 October 2023
MS-BioGraphs: Sequence Similarity Graph Datasets30 August 2023
MS-BioGraphs MS10 August 2023
MS-BioGraphs MSA50010 August 2023
MS-BioGraphs MS20010 August 2023
MS-BioGraphs MSA20010 August 2023
MS-BioGraphs MS5010 August 2023
MS-BioGraphs MSA5010 August 2023
MS-BioGraphs MSA1010 August 2023
MS-BioGraphs MS110 August 2023
MS-BioGraphs Validation10 August 2023

MS-BioGraphs MSA50

Posted on 10 August 2023 by Mohsen

Name	MS-BioGraphs – MSA50
URL	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-MSA50
Download Link	https://dx.doi.org/10.21227/gmd9-1534
Script for Downloading All Files	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-on-IEEE-DataPort/
Validating and Sample Code	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation/
Graph Explanation	Vertices represent proteins and each edge represents the sequence similarity between its two endpoints
Edge Weighted	Yes
Directed	Yes
Number of Vertices	1,757,323,526
Number of Edges	125,312,536,732
Maximum In-Degree	543,117
Maximum Out-Degree	297,981
Minimum Weight	98
Maximum Weight	634,925
Number of Zero In-Degree Vertices	6,437,984
Number of Zero Out-Degree Vertices	8,542,018
Average In-Degree	71.6
Average Out-Degree	71.7
Size of The Largest Weakly Connected Component	117,980,151,055
Number of Weakly Connected Components	363,090,851
Creation Details	MS-BioGraphs: Sequency Similarity Graph Datasets
Format	WebGraph
License	CC BY-NC-SA
QUB IDF	2223-052
DOI	10.5281/zenodo.7820821
Citation	Mohsen Koohi Esfahani, Sebastiano Vigna, Paolo Boldi, Hans Vandierendonck, Peter Kilpatrick, March 13, 2024, "MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets", IEEE Dataport, doi: https://dx.doi.org/10.21227/gmd9-1534.
Bibtex	@data{gmd9-1534-24, doi = {10.21227/gmd9-1534}, url = {https://dx.doi.org/10.21227/gmd9-1534}, author = {Koohi Esfahani, Mohsen and Vigna, Sebastiano and Boldi, Paolo and Vandierendonck, Hans and Kilpatrick, Peter}, publisher = {IEEE Dataport}, title = {MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets}, year = {2024} }

Files

Underlying Graph	The underlying graph in WebGraph format: File: MSA50-underlying.graph, Size: 410,094,612,576 Bytes File: MSA50-underlying.offsets, Size: 3,504,554,221 Bytes File: MSA50-underlying.properties, Size: 1,493 Bytes Total Size: 413,599,168,290 Bytes These files are validated using ‘Edge Blocks SHAs File’ as follows.
Weights (Labels)	The weights of the graph in WebGraph format: File: MSA50-weights.labels, Size: 284,756,409,010 Bytes File: MSA50-weights.labeloffsets, Size: 3,374,946,996 Bytes File: MSA50-weights.properties, Size: 186 Bytes Total Size: 288,131,356,192 Bytes These files are validated using ‘Edge Blocks SHAs File’ as follows.
Edge Blocks SHAs File (Text)	This file contains the shasums of edge blocks where each block contains 64 Million continuous edges and has one shasum for its 64M endpoints and one for its 64M edge weights. The file is used to validate the underlying graph and the weights. For further explanation about validation process, please visit the https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation. Name: MSA50_edges_shas.txt Size: 224,400 Bytes SHASUM: 6f56a6710ef6b6e7c01e90907f19c7a0099a272c
Offsets (Binary)	The offsets array of the CSX (Compressed Sparse Rows/Columns) graph in binary format and little endian order. It consists of \|V\|+1 8-Bytes elements. The first and last values are 0 and \|E\|, respectively. This array helps converting the graph (or parts of it) from WebGraph format to binary format by one pass over (related) edges. Name: MSA50_offsets.bin Size: 14,058,588,216 Bytes SHASUM: 3272fb9c681648598f18ab5a10bbafb5bf48dca5
WCC (Binary)	The Weakly-Connected Compontent (WCC) array in binary format and little endian order. This array consists of \|V\| 4-Bytes elements The vertices in the same component have the same values in the WCC array. Name: MSA50-wcc.bin Size: 7,029,294,104 Bytes SHASUM: 82e3ba326bb56c69edbe7fbb90ce70b731e3a7f2
Transposed’s Offsets (Binary)	The offsets array of the transposed graph in binary format and little endian order. It consists of \|V\|+1 8-Bytes elements. The first and last values are 0 and \|E\|, respectively. It helps to transpose the graph by performing one pass over edges. Name: MSA50_trans_offsets.bin Size: 14,058,588,216 Bytes SHASUM: 812d75359683dd235a1bd948566b306f43e7088d
Names (tar.gz)	This compressed file contains 120 files in CSV format using ‘;’ as the separator. Each row has two columns: ID of vertex and name of the sequence. Note: If the graph has a ‘N2O Reordering’ file, the n2o array should be used to convert the vertex ID to old vertex ID which is used for identifying name of the protein in the `names.tar.gz` file. Name: names.tar.gz Size: 27,130,045,933 Bytes SHASUM: ba00b58bbb2795445554058a681b573c751ef315
OJSON	The charactersitics of the graph and shasums of the files. It is in the open json format and needs a closing brace (}) to be appended before being passed to a json parser. Name: MSA50.ojson Size: 892 Bytes SHASUM: 5767cdd2e0cddba1ba255afe9accfdbe5d5aabd2

Plots

For the explanation about the plots, please refer to the MS-BioGraphs paper.
To have a better resolution, please click on the images.

In-Degree Distribution
Out-Degree Distribution
Weight Distribution
Vertex-Relative Weight Distribution
Degree Decomposition
Push and Pull Locality
Cell-Binned Average Weight Degree Distribution
Weakly-Connected Components Size Distribution

MS-BioGraphs

Related Posts

MS-BioGraphs on IEEE DataPort17 April 2024
ParaGrapher Source Code For WebGraph Types16 February 2024
On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets – BigData’23 (Short Paper)15 December 2023
Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs – IISWC’23 (Poster)2 October 2023
MS-BioGraphs: Sequence Similarity Graph Datasets30 August 2023
MS-BioGraphs MS10 August 2023
MS-BioGraphs MSA50010 August 2023
MS-BioGraphs MS20010 August 2023
MS-BioGraphs MSA20010 August 2023
MS-BioGraphs MS5010 August 2023
MS-BioGraphs MSA5010 August 2023
MS-BioGraphs MSA1010 August 2023
MS-BioGraphs MS110 August 2023
MS-BioGraphs Validation10 August 2023

MS-BioGraphs MSA10

Posted on 10 August 2023 by Mohsen

Name	MS-BioGraphs – MSA10
URL	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-MSA10
Download Link	https://dx.doi.org/10.21227/gmd9-1534
Script for Downloading All Files	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-on-IEEE-DataPort/
Validating and Sample Code	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation/
Graph Explanation	Vertices represent proteins and each edge represents the sequence similarity between its two endpoints
Edge Weighted	Yes
Directed	Yes
Number of Vertices	1,757,323,526
Number of Edges	25,236,632,682
Maximum In-Degree	207,279
Maximum Out-Degree	62,060
Minimum Weight	98
Maximum Weight	634,925
Number of Zero In-Degree Vertices	6,437,984
Number of Zero Out-Degree Vertices	9,926,249
Average In-Degree	14.4
Average Out-Degree	14.4
Size of The Largest Weakly Connected Component	15,576,385,764
Number of Weakly Connected Components	628,505,933
Creation Details	MS-BioGraphs: Sequency Similarity Graph Datasets
Format	WebGraph
License	CC BY-NC-SA
QUB IDF	2223-052
DOI	10.5281/zenodo.7820823
Citation	Mohsen Koohi Esfahani, Sebastiano Vigna, Paolo Boldi, Hans Vandierendonck, Peter Kilpatrick, March 13, 2024, "MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets", IEEE Dataport, doi: https://dx.doi.org/10.21227/gmd9-1534.
Bibtex	@data{gmd9-1534-24, doi = {10.21227/gmd9-1534}, url = {https://dx.doi.org/10.21227/gmd9-1534}, author = {Koohi Esfahani, Mohsen and Vigna, Sebastiano and Boldi, Paolo and Vandierendonck, Hans and Kilpatrick, Peter}, publisher = {IEEE Dataport}, title = {MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets}, year = {2024} }

Files

Underlying Graph	The underlying graph in WebGraph format: File: MSA10-underlying.graph, Size: 87,421,101,649 Bytes File: MSA10-underlying.offsets, Size: 2,743,422,804 Bytes File: MSA10-underlying.properties, Size: 1,439 Bytes Total Size: 90,164,525,892 Bytes These files are validated using ‘Edge Blocks SHAs File’ as follows.
Weights (Labels)	The weights of the graph in WebGraph format: File: MSA10-weights.labels, Size: 58,798,062,287 Bytes File: MSA10-weights.labeloffsets, Size: 2,731,563,328 Bytes File: MSA10-weights.properties, Size: 186 Bytes Total Size: 61,529,625,801 Bytes These files are validated using ‘Edge Blocks SHAs File’ as follows.
Edge Blocks SHAs File (Text)	This file contains the shasums of edge blocks where each block contains 64 Million continuous edges and has one shasum for its 64M endpoints and one for its 64M edge weights. The file is used to validate the underlying graph and the weights. For further explanation about validation process, please visit the https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation. Name: MSA10_edges_shas.txt Size: 45,480 Bytes SHASUM: 9c42e8ba057c519ae318071e63ab3ffdf992cd50
Offsets (Binary)	The offsets array of the CSX (Compressed Sparse Rows/Columns) graph in binary format and little endian order. It consists of \|V\|+1 8-Bytes elements. The first and last values are 0 and \|E\|, respectively. This array helps converting the graph (or parts of it) from WebGraph format to binary format by one pass over (related) edges. Name: MSA10_offsets.bin Size: 14,058,588,216 Bytes SHASUM: b42a8f6aee7c0abdd715f523238ea59acb09c24b
WCC (Binary)	The Weakly-Connected Compontent (WCC) array in binary format and little endian order. This array consists of \|V\| 4-Bytes elements The vertices in the same component have the same values in the WCC array. Name: MSA10-wcc.bin Size: 7,029,294,104 Bytes SHASUM: 37f30d638341fa50ae9c73893e7cab689ef14be8
Transposed’s Offsets (Binary)	The offsets array of the transposed graph in binary format and little endian order. It consists of \|V\|+1 8-Bytes elements. The first and last values are 0 and \|E\|, respectively. It helps to transpose the graph by performing one pass over edges. Name: MSA10_trans_offsets.bin Size: 14,058,588,216 Bytes SHASUM: 2ae765f6f79b8f41221ba0d869648d01d19bcadd
Names (tar.gz)	This compressed file contains 120 files in CSV format using ‘;’ as the separator. Each row has two columns: ID of vertex and name of the sequence. Note: If the graph has a ‘N2O Reordering’ file, the n2o array should be used to convert the vertex ID to old vertex ID which is used for identifying name of the protein in the `names.tar.gz` file. Name: names.tar.gz Size: 27,130,045,933 Bytes SHASUM: ba00b58bbb2795445554058a681b573c751ef315
OJSON	The charactersitics of the graph and shasums of the files. It is in the open json format and needs a closing brace (}) to be appended before being passed to a json parser. Name: MSA10.ojson Size: 885 Bytes SHASUM: 0d8c48f9297d36a628aabcd8576cb0c083607534

Plots

For the explanation about the plots, please refer to the MS-BioGraphs paper.
To have a better resolution, please click on the images.

In-Degree Distribution
Out-Degree Distribution
Weight Distribution
Vertex-Relative Weight Distribution
Degree Decomposition
Push and Pull Locality
Cell-Binned Average Weight Degree Distribution
Weakly-Connected Components Size Distribution

MS-BioGraphs

Related Posts

MS-BioGraphs on IEEE DataPort17 April 2024
ParaGrapher Source Code For WebGraph Types16 February 2024
On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets – BigData’23 (Short Paper)15 December 2023
Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs – IISWC’23 (Poster)2 October 2023
MS-BioGraphs: Sequence Similarity Graph Datasets30 August 2023
MS-BioGraphs MS10 August 2023
MS-BioGraphs MSA50010 August 2023
MS-BioGraphs MS20010 August 2023
MS-BioGraphs MSA20010 August 2023
MS-BioGraphs MS5010 August 2023
MS-BioGraphs MSA5010 August 2023
MS-BioGraphs MSA1010 August 2023
MS-BioGraphs MS110 August 2023
MS-BioGraphs Validation10 August 2023

MS-BioGraphs MS1

Posted on 10 August 2023 by Mohsen

Name	MS-BioGraphs – MS1
URL	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-MS1
Download Link	https://dx.doi.org/10.21227/gmd9-1534
Script for Downloading All Files	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-on-IEEE-DataPort/
Validating and Sample Code	https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation/
Graph Explanation	Vertices represent proteins and each edge represents the sequence similarity between its two endpoints
Edge Weighted	Yes
Directed	No
Number of Vertices	43,144,218
Number of Edges	2,660,495,200
Maximum Degree	14,212
Minimum Weight	3,680
Maximum Weight	634,925
Number of Zero-Degree Vertices	0
Average Degree	61.7
Size of The Largest WCC	124,003,393
Number of WCC	15,746,208
Creation Details	MS-BioGraphs: Sequency Similarity Graph Datasets
Format	WebGraph
License	CC BY-NC-SA
QUB IDF	2223-052
DOI	10.5281/zenodo.7820827
Citation	Mohsen Koohi Esfahani, Sebastiano Vigna, Paolo Boldi, Hans Vandierendonck, Peter Kilpatrick, March 13, 2024, "MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets", IEEE Dataport, doi: https://dx.doi.org/10.21227/gmd9-1534.
Bibtex	@data{gmd9-1534-24, doi = {10.21227/gmd9-1534}, url = {https://dx.doi.org/10.21227/gmd9-1534}, author = {Koohi Esfahani, Mohsen and Vigna, Sebastiano and Boldi, Paolo and Vandierendonck, Hans and Kilpatrick, Peter}, publisher = {IEEE Dataport}, title = {MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets}, year = {2024} }

Files

Underlying Graph	The underlying graph in WebGraph format: File: MS1-underlying.graph, Size: 6,300,911,484 Bytes File: MS1-underlying.offsets, Size: 77,574,569 Bytes File: MS1-underlying.properties, Size: 1,288 Bytes Total Size: 6,378,487,341 Bytes These files are validated using ‘Edge Blocks SHAs File’ as follows.
Weights (Labels)	The weights of the graph in WebGraph format: File: MS1-weights.labels, Size: 8,201,441,365 Bytes File: MS1-weights.labeloffsets, Size: 80,797,007 Bytes File: MS1-weights.properties, Size: 184 Bytes Total Size: 8,282,238,556 Bytes These files are validated using ‘Edge Blocks SHAs File’ as follows.
Edge Blocks SHAs File (Text)	This file contains the shasums of edge blocks where each block contains 64 Million continuous edges and has one shasum for its 64M endpoints and one for its 64M edge weights. The file is used to validate the underlying graph and the weights. For further explanation about validation process, please visit the https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation. Name: MS1_edges_shas.txt Size: 5,040 Bytes SHASUM: 27974edb4bf8f3b17b00ff3a72a703da18f3807a
Offsets (Binary)	The offsets array of the CSX (Compressed Sparse Rows/Columns) graph in binary format and little endian order. It consists of \|V\|+1 8-Bytes elements. The first and last values are 0 and \|E\|, respectively. This array helps converting the graph (or parts of it) from WebGraph format to binary format by one pass over (related) edges. Name: MS1_offsets.bin Size: 345,153,752 Bytes SHASUM: 0abedde32e1ac7181897f82d10d40acfe14f2022
WCC (Binary)	The Weakly-Connected Compontent (WCC) array in binary format and little endian order. This array consists of \|V\| 4-Bytes elements The vertices in the same component have the same values in the WCC array. Name: MS1-wcc.bin Size: 172,576,872 Bytes SHASUM: 4c491dd96e3582b70a203ae4a910001381278d75
Names (tar.gz)	This compressed file contains 120 files in CSV format using ‘;’ as the separator. Each row has two columns: ID of vertex and name of the sequence. Note: If the graph has a ‘N2O Reordering’ file, the n2o array should be used to convert the vertex ID to old vertex ID which is used for identifying name of the protein in the `names.tar.gz` file. Name: names.tar.gz Size: 27,130,045,933 Bytes SHASUM: ba00b58bbb2795445554058a681b573c751ef315
N2O Reordering (Binary)	The New to Old (N2O) reordering array of the graph in binary format and little endian order. It consists of \|V\| 4-Bytes elements and identifies the old ID of each vertex which is used in searching the name of vertex (protein) in the names.tar.gz file . Name: MS1-n2o.bin Size: 172,576,872 Bytes SHASUM: b163320b6349fed7a00fb17c4a4a22e7d124b716
OJSON	The charactersitics of the graph and shasums of the files. It is in the open json format and needs a closing brace (}) to be appended before being passed to a json parser. Name: MS1.ojson Size: 736 Bytes SHASUM: c60afa0652955fd46f1bb8056380523504d69fa6

Plots

For the explanation about the plots, please refer to the MS-BioGraphs paper.
To have a better resolution, please click on the images.

Degree Distribution
Weight Distribution
Vertex-Relative Weight Distribution
Degree Decomposition
Cell-Binned Average Weight Degree Distribution
Weakly-Connected Components Size Distribution

MS-BioGraphs

Related Posts

MS-BioGraphs on IEEE DataPort17 April 2024
ParaGrapher Source Code For WebGraph Types16 February 2024
On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets – BigData’23 (Short Paper)15 December 2023
Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs – IISWC’23 (Poster)2 October 2023
MS-BioGraphs: Sequence Similarity Graph Datasets30 August 2023
MS-BioGraphs MS10 August 2023
MS-BioGraphs MSA50010 August 2023
MS-BioGraphs MS20010 August 2023
MS-BioGraphs MSA20010 August 2023
MS-BioGraphs MS5010 August 2023
MS-BioGraphs MSA5010 August 2023
MS-BioGraphs MSA1010 August 2023
MS-BioGraphs MS110 August 2023
MS-BioGraphs Validation10 August 2023

MS-BioGraphs Validation

Posted on 10 August 2023 by Mohsen

Repository

https://github.com/DIPSA-QUB/MS-BioGraphs-Validation

Explanation

We provide a Shell script, validation.sh, and a Java program, EdgeBlockSHA.java, to verify the the correctness of the graphs. Each graph has a .ojson file whose shasum is verified by the value retreived from our server. Files such as offsets.bin, wcc.bin, n2o.bin, trans_offsets.bin, and edges_shas.txt have shasum records in the ojson file which is used for validation of these files.

The graph in WebGraph format has been compressed in MS??-underlying.* and MS??-weights.* files. In order to validate the compressed graph, the EdgeBlockSHA.java is used. It is a parallel Java code that uses the WebGraph library to traverse the graph and calculate the shasum of blocks of edges (endpoints and weights). Then, the calculated results are matched with the edges_shas.txt file of the graph.

It is also possible to validate some particular blocks by matching the calculated shasum with the relevant row in the edges_shas.txt file. This file has a format such as the following. Each block contains 64 Million consecutive edges. The start of each block is identified by a vertex ID and its edge index. The Column endpoint_sha is the shasum of the 64 Million endpoints when stored as an array of 4-Bytes elements in the binary format and in the little endian order. Similarly, Column weights_sha shows the shasum of weights (labels). We have separated weights from endpoints as in some applications weights are not needed and therefore it is not necessary to read and validate them.

64MB blk#;     vertex; edge index;                             endpoint_sha;                              weights_sha;
         0;          0;          0; 509784b158cb9404241afb21d0ceaf590b88d2f2; 57da4ad7bb89c5922e436b0535d791fa8f40dffd;
         1;    2315113;        705; fafc118563c1d7b5fbff64af56edd6a56524f479; 13b7a9ca60bfb0715d563218d0a1cd787b00a07c;
         2;    4521625;        597; 4ed65aa07c8062a151166ef2e9bdb93e41d19357; 8158276bec426ee46eca9912759eb9bd57fcc957;
         3;    6347361;        112; d02e8913c807c3f4ecde9c638e0ded5ab80ba819; 26bc3296de65cba6ac539cd96b79ae6f7a4d37be;
         4;    8447869;         15; 61513c84db40124496cdf769516118b63598914f; 781b9f4372ac614e94d097017c756d015234deb6;

Requirements

JDK with version > 15
jq
wget

WebGraph Framework

Please visit https://webgraph.di.unimi.it .

ParaGrapher Graph Loading API and Library

The WebGraph formats can also be read using the ParaGrapher library: https://blogs.qub.ac.uk/DIPSA/ParaGrapher/.

License

Licensed under the GNU v3 General Public License, as published by the Free Software Foundation. You must not use this Software except in compliance with the terms of the License. Unless required by applicable law or agreed upon in writing, this Software is distributed on an “as is” basis, without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose, neither express nor implied.

Copyright 2022-2023 The Queen’s University of Belfast, Northern Ireland, UK

MS-BioGraphs

Related Posts

MS-BioGraphs on IEEE DataPort17 April 2024
ParaGrapher Source Code For WebGraph Types16 February 2024
On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets – BigData’23 (Short Paper)15 December 2023
Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs – IISWC’23 (Poster)2 October 2023
MS-BioGraphs: Sequence Similarity Graph Datasets30 August 2023
MS-BioGraphs MS10 August 2023
MS-BioGraphs MSA50010 August 2023
MS-BioGraphs MS20010 August 2023
MS-BioGraphs MSA20010 August 2023
MS-BioGraphs MS5010 August 2023
MS-BioGraphs MSA5010 August 2023
MS-BioGraphs MSA1010 August 2023
MS-BioGraphs MS110 August 2023
MS-BioGraphs Validation10 August 2023

On Designing Structure-Aware High-Performance Graph Algorithms (PhD Thesis)

Posted on 8 December 2022 by Mohsen

Mohsen Koohi Esfahani
Supervisors: Hans Vandierendonck and Peter Kilpatrick

Thesis in PDF format
Thesis on QUB Pure Portal

Graph algorithms find several usages in industry, science, humanities, and technology. The fast-growing size of graph datasets in the context of the processing model of the current hardware has resulted in different bottlenecks such as memory locality, work-efficiency, and load-balance that degrade the performance. To tackle these limitations, high-performance computing considers different aspects of the execution in order to design optimized algorithms through efficient usage of hardware resources.

The main idea in this thesis is to analyze the structure of graphs to exploit special features that are key to introduce new graph algorithms with optimized performance.

First, we study the structure of real-world graph datasets with skewed degree distribution and the applicability of graph relabeling algorithms as the main restructuring tools to improve performance and memory locality. To that end, we introduce novel locality metrics including Cache Miss Rate Degree Distribution, Effective Cache Size, Push Locality and Pull Locality, and Degree Range Decomposition.

Based on this structural analysis, we introduce the Uniform Memory Demands strategy that (i) recognizes diverse memory demands and behaviours as a source of performance inefficiency, (ii) separates contrasting memory demands into groups with uniform behaviours across each group, and (iii) designs bespoke data structures and algorithms for each group in order to satisfy memory demands with the lowest overhead.

We apply the Uniform Memory Demands strategy to design three graph algorithms with optimized performance: (i) the SAPCo Sort algorithm as a parallel counting sort algorithm that is faster than comparison-based sorting algorithms in degree-ordering of power-law graphs, (ii) the iHTL algorithm that optimizes locality in Sparse Matrix-Vector (SpMV) Multiplication graph algorithms by extracting dense subgraphs containing incoming edges to in-hubs and processing them in the push direction, and (iii) the LOTUS algorithm that optimizes locality in Triangle Counting by separating different caching demands and deploying specific data structure and algorithm for each of them.

Bibtex

@phdthesis{ODSAGA-ethos.874822,
  title  = {On Designing Structure-Aware High-Performance Graph Algorithms},
  author = {Mohsen Koohi Esfahani},
  year   = 2022,
  url    = {https://blogs.qub.ac.uk/DIPSA/On-Designing-Structure-Aware-High-Performance-Graph-Algorithms-PhD-Thesis/},
  school = {Queen's University Belfast},
  EThOSID = {uk.bl.ethos.874822}
}

ParaGrapher Integrated to LaganLighter16 February 2024
On Designing Structure-Aware High-Performance Graph Algorithms (PhD Thesis)8 December 2022
LaganLighter Source Code14 November 2022
MASTIFF: Structure-Aware Minimum Spanning Tree/Forest – ICS’2228 June 2022
SAPCo Sort: Optimizing Degree-Ordering for Power-Law Graphs – ISPASS’22 (Poster)23 May 2022
LOTUS: Locality Optimizing Triangle Counting – PPOPP’225 April 2022
Locality Analysis of Graph Reordering Algorithms – IISWC’218 November 2021
Thrifty Label Propagation: Fast Connected Components for Skewed-Degree Graphs – IEEE CLUSTER’219 September 2021
Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing – ICPP’219 August 2021
How Do Graph Relabeling Algorithms Improve Memory Locality? ISPASS’21 (Poster)28 March 2021

LaganLighter

LaganLighter Source Code

Posted on 14 November 2022 by Mohsen

Repository

https://github.com/DIPSA-QUB/LaganLighter

Algorithms in This Repo

SAPCo Sort: alg1_sapco_sort
Thrifty Label Propagation Connected Components: alg2_thrifty
MASTIFF: Structure-Aware Mimum Spanning Tree/Forest: alg3_mastiff
iHTL: in-Hub Temporal Locality in SpMV (Sparse-Matrix Vector Multiplication) based Graph Processing: (to be added)
LOTUS: Locality Optimizing Trinagle Counting: (to be added)

Cloning

git clone https://github.com/DIPSA-QUB/LaganLighter.git --recursive

Graph Types

LaganLighter supports the following graph formats:

CSR/CSC graph in text format, for testing. This format has 4 lines: (i) number of vertices (|V|), (ii) number of edges (|E|), (iii) |V| space-separated numbers showing offsets of the vertices, and (iv) |E| space-separated numbers indicating edges.
CSR WebGraph format: supported by the Poplar Graph Loading Library
external git repository

Measurements

In addition to execution time, we use the PAPI library to measure hardware counters such as L3 cache misses, hardware instructions, DTLB misses, and load and store memory instructions. ( papi_(init/start/reset/stop) and (print/reset)_hw_events functions defined in omp.c).

To measure load balance, we measure the total time of executing a loop and the time each thread spends in this loop (mt and ttimes in the following sample code). Using these values, PTIP macro (defined in omp.c) calculates the percentage of average idle time (as an indicator of load imbalance) and prints it with the total time (mt).

mt = - get_nano_time()
#pragma omp parallel  
{
   unsigned tid = omp_get_thread_num();
   ttimes[tid] = - get_nano_time();
	
   #pragma omp for nowait
   for(unsigned int v = 0; v < g->vertices_count; v++)
   {
      // .....
   }
   ttimes[tid] += get_nano_time();
}
mt += get_nano_time();
PTIP("Step ... ");

As an example, the following execution of Thrifty, shows that the “Zero Planting” step has been performed in 8.98 milliseconds and with a 8.22% load imbalance, while processors have been idle for 72.22% of the execution time, on average, in the “Initial Push” step.

NUMA-Aware and Locality-Preserving Partitioning and Scheduling

In order to assign consecutive partitions (vertices and/or their edges) to each parallel processor, we initially divide partitions and assign a number of consecutive partitions to each thread. Then, we specify the order of victim threads in the work-stealing process. During the initialization of LaganLighter parallel processing environment (in initialize_omp_par_env() function defined in file omp.c), for each thread, we create a list of threads as consequent victims of stealing.

A thread, first, steals jobs (i.e., partitions) from consequent threads in the same NUMA node and then from the threads in consequent NUMA nodes. As an example, the following image shows the stealing order of a 24-core machine with 2 NUMA nodes. This shows that thread 1 steals from threads 2, 3, …,11, and ,0 running on the same NUMA socket and then from threads 13, 14, …, 23, and 12 running on the next NUMA socket.

We use dynamic_partitioning_...() functions (in file partitioning.c) to process partitions by threads in the specified order. A sample code is in the following:

struct dynamic_partitioning* dp = dynamic_partitioning_initialize(pe, partitions_count);

#pragma omp parallel  
{
   unsigned int tid = omp_get_thread_num();
   unsigned int partition = -1U;		

   while(1)
   {
      partition = dynamic_partitioning_get_next_partition(dp, tid, partition);
      if(partition == -1U)
	 break; 

      for(v = start_vertex[partition]; v < start_vertex[partition + 1]; v++)
      {
	// ....
       }
   }
}

dynamic_partitioning_reset(dp);

Bugs & Support

As “we write bugs that in particular cases have been tested to work correctly”, we try to evaluate and validate the algorithms and their implementations. If you receive wrong results or you are suspicious about parts of the code, please contact us or submit an issue.

License

LaganLighter

ParaGrapher Integrated to LaganLighter16 February 2024
On Designing Structure-Aware High-Performance Graph Algorithms (PhD Thesis)8 December 2022
LaganLighter Source Code14 November 2022
MASTIFF: Structure-Aware Minimum Spanning Tree/Forest – ICS’2228 June 2022
SAPCo Sort: Optimizing Degree-Ordering for Power-Law Graphs – ISPASS’22 (Poster)23 May 2022
LOTUS: Locality Optimizing Triangle Counting – PPOPP’225 April 2022
Locality Analysis of Graph Reordering Algorithms – IISWC’218 November 2021
Thrifty Label Propagation: Fast Connected Components for Skewed-Degree Graphs – IEEE CLUSTER’219 September 2021
Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing – ICPP’219 August 2021
How Do Graph Relabeling Algorithms Improve Memory Locality? ISPASS’21 (Poster)28 March 2021

Approximate Maximum Weighted Clique

Posted on 25 August 2022 by Qasim Abbas

This project aims to develop novel algorithms for the maximum weighted clique (MWC) problem, which appears in various data analysis pipelines in precision medicine. The MWC problem is NP-hard in nature, which makes it particularly challenging given the exponentially increasing amount of data it is applied to.

Although several attempts have been made to solve the maximum weighted clique problem in large graphs, there is still much opportunity for lowering the execution time necessary to find a satisfactory solution. In this project in particular we are investigating approximate algorithms for the MWC problem. We are working towards an algorithm that achieves a very high quality solution (i.e., finding a clique with weight very close to the MWC) in polynomial time.

IBM will provide industrially relevant context on knowledge extraction from graph-structured data. They have extensive experience in this area by building scalable software systems for the analysis of massive-scale graph data. They will moreover provide access to relevant datasets.

Project Members

Funding

This PhD project is funded by the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie Actions.

CITI-GENS MSCA CO-FUND

MASTIFF: Structure-Aware Minimum Spanning Tree/Forest – ICS’22

Posted on 28 June 2022 by Mohsen

36th ACM International Conference on Supercomputing 2022
June 27-30, 2022
Acceptance Rate: 25%

DOI: 10.1145/3524059.3532365
Authors’ Copy (PDF Format)

The Minimum Spanning Forest (MSF) problem finds usage in many different applications. While theoretical analysis shows that linear-time solutions exist, in practice, parallel MSF algorithms remain computationally demanding due to the continuously increasing size of data sets.

In this paper, we study the MSF algorithm from the perspective of graph structure and investigate the implications of the power-law degree distribution of real-world graphs
on this algorithm.

We introduce the MASTIFF algorithm as a structure-aware MSF algorithm that optimizes work efficiency by (1) dynamically tracking the largest forest component of each graph component and exempting them from processing, and (2) by avoiding topology-related operations such as relabeling and merging neighbour lists.

The evaluations on 2 different processor architectures with up to 128 cores and on graphs of up to 124 billion edges, shows that Mastiff is 3.4–5.9× faster than previous works.

Code Availability
The source-code of MASTIFF is available on LaganLighter Repository (alg3_mastiff.c and msf.c files). A sample execution of this source code for “Twitter-MPI” graph is shown in the following:

BibTex

@INPROCEEDINGS{10.1145/3524059.3532365,
author = {Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
title = {{MASTIFF}: Structure-Aware Minimum Spanning Tree/Forest},
year = {2022},
isbn = {},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3524059.3532365},
doi = {10.1145/3524059.3532365},
booktitle = {Proceedings of the 36th ACM International Conference on Supercomputing},
numpages = {13}
}

LaganLighter

Related Posts

ParaGrapher Integrated to LaganLighter16 February 2024
On Designing Structure-Aware High-Performance Graph Algorithms (PhD Thesis)8 December 2022
LaganLighter Source Code14 November 2022
MASTIFF: Structure-Aware Minimum Spanning Tree/Forest – ICS’2228 June 2022
SAPCo Sort: Optimizing Degree-Ordering for Power-Law Graphs – ISPASS’22 (Poster)23 May 2022
LOTUS: Locality Optimizing Triangle Counting – PPOPP’225 April 2022
Locality Analysis of Graph Reordering Algorithms – IISWC’218 November 2021
Thrifty Label Propagation: Fast Connected Components for Skewed-Degree Graphs – IEEE CLUSTER’219 September 2021
Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing – ICPP’219 August 2021
How Do Graph Relabeling Algorithms Improve Memory Locality? ISPASS’21 (Poster)28 March 2021

DIPSA: Data-Intensive Parallel Systems and Algorithms

Tag Archives: graph processing

MS-BioGraphs MSA200

Files

Plots

MS-BioGraphs MS50

Files

Plots

MS-BioGraphs MSA50

Files

Plots

MS-BioGraphs MSA10

Files

Plots

MS-BioGraphs MS1

Files

Plots

MS-BioGraphs Validation

Repository

Explanation

Requirements

WebGraph Framework

License

Copyright 2022-2023 The Queen’s University of Belfast, Northern Ireland, UK

On Designing Structure-Aware High-Performance Graph Algorithms (PhD Thesis)

Related Posts

LaganLighter Source Code

Repository

Algorithms in This Repo

Cloning

Graph Types

Measurements

NUMA-Aware and Locality-Preserving Partitioning and Scheduling

Bugs & Support

License

Related Posts

Approximate Maximum Weighted Clique

MASTIFF: Structure-Aware Minimum Spanning Tree/Forest – ICS’22