{"id":2309,"date":"2023-08-10T10:02:53","date_gmt":"2023-08-10T09:02:53","guid":{"rendered":"https:\/\/blogs.qub.ac.uk\/dipsa\/?page_id=2309"},"modified":"2025-02-28T04:41:22","modified_gmt":"2025-02-28T04:41:22","slug":"ms-biographs","status":"publish","type":"page","link":"https:\/\/blogs.qub.ac.uk\/dipsa\/ms-biographs\/","title":{"rendered":"MS-BioGraphs: Trillion-Scale Public Real-World Sequence Similarity Graph Datasets"},"content":{"rendered":"\n<h2 class=\"wp-block-heading has-text-align-justify has-medium-font-size\"><strong>Project Statement<\/strong><\/h2>\n\n\n\n<p class=\"has-text-align-justify\">Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets. <br>We (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family of publicly-available real-world graph datasets with up to 2.5 trillion edges, that is, 6.6 times greater than the largest graph published recently. <br>The largest graph is created by matching (i.e., all-to-all similarity aligning) 1.7 billion protein sequences. The MS-BioGraphs family includes also seven subgraphs with different sizes and direction types.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-justify has-medium-font-size\"><strong>Project Steps<\/strong><\/h2>\n\n\n\n<p class=\"has-text-align-justify\">(i) Creating and validating the MS-BioGraphs by designing and engineering parallel and distributed algorithms and data structures to optimize performance and cluster utilization,<br>(ii) Extending the <a href=\"https:\/\/webgraph.di.unimi.it\/\" target=\"_blank\" rel=\"noreferrer noopener\">WebGraph<\/a> framework with parallel compression for MS-BioGraphs as edge-weighted graphs, and <br>(iii) Analyzing the structural characteristics of MS-BioGraphs by comparing to other real-world graphs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-medium-font-size\"><strong>Datasets on IEEE DataPort<\/strong><\/h2>\n\n\n\n<p>DOI: <a href=\"https:\/\/doi.org\/10.21227\/gmd9-1534\">https:\/\/doi.org\/10.21227\/gmd9-1534<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading has-medium-font-size\"><strong>Validation &amp; Sample Code<\/strong><\/h2>\n\n\n\n<p>Please visit the <a href=\"https:\/\/blogs.qub.ac.uk\/DIPSA\/MS-BioGraphs-Validation\/\" data-type=\"link\" data-id=\"https:\/\/blogs.qub.ac.uk\/DIPSA\/MS-BioGraphs-Validation\/\" target=\"_blank\" rel=\"noreferrer noopener\">MS-BioGraphs Validation<\/a> post.<br>The ParaGrapher library (<a href=\"https:\/\/blogs.qub.ac.uk\/DIPSA\/ParaGrapher\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/blogs.qub.ac.uk\/DIPSA\/ParaGrapher\/<\/a>) may be used to access MS-BioGraphs in C\/C++.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-justify has-medium-font-size\"><strong>Datasets, Source Code,  and Publications<\/strong><\/h2>\n\n\n<ul class=\"wp-block-latest-posts__list has-dates wp-block-latest-posts\"><li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2024\/08\/trees-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/minimum-spanning-forest-of-ms-biographs\/\">Minimum Spanning Forest of MS-BioGraphs<\/a><time datetime=\"2024-08-09T14:11:36+01:00\" class=\"wp-block-latest-posts__post-date\">9 August 2024<\/time><\/li>\n<li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2024\/04\/ivy-2-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/ms-biographs-on-ieee-dataport\/\">MS-BioGraphs on IEEE DataPort<\/a><time datetime=\"2024-04-17T07:26:23+01:00\" class=\"wp-block-latest-posts__post-date\">17 April 2024<\/time><\/li>\n<li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2024\/02\/poplar2-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/paragrapher-source-code-for-webgraph-types\/\">ParaGrapher Source Code For WebGraph Types<\/a><time datetime=\"2024-02-16T08:13:13+00:00\" class=\"wp-block-latest-posts__post-date\">16 February 2024<\/time><\/li>\n<li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2023\/11\/goldcrest-1-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/on-overcoming-hpc-challenges-of-trillion-scale-real-world-graph-datasets\/\">On Overcoming HPC Challenges of  Trillion-Scale Real-World Graph Datasets \u2013 BigData&#8217;23 (Short Paper)<\/a><time datetime=\"2023-12-15T02:47:00+00:00\" class=\"wp-block-latest-posts__post-date\">15 December 2023<\/time><\/li>\n<li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2023\/08\/10-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/dataset-announcement-ms-biographs-trillion-scale-public-real-world-sequence-similarity-graphs\/\">Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs &#8211; IISWC&#8217;23 (Poster)<\/a><time datetime=\"2023-10-02T00:26:00+01:00\" class=\"wp-block-latest-posts__post-date\">2 October 2023<\/time><\/li>\n<li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2023\/08\/2-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/ms-biographs-sequence-similarity-graph-datasets\/\">MS-BioGraphs: Sequence Similarity Graph Datasets<\/a><time datetime=\"2023-08-30T06:52:00+01:00\" class=\"wp-block-latest-posts__post-date\">30 August 2023<\/time><\/li>\n<li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2023\/08\/1-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/ms-biographs-ms\/\">MS-BioGraphs MS<\/a><time datetime=\"2023-08-10T09:53:42+01:00\" class=\"wp-block-latest-posts__post-date\">10 August 2023<\/time><\/li>\n<li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2023\/08\/6-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/ms-biographs-msa500\/\">MS-BioGraphs MSA500<\/a><time datetime=\"2023-08-10T09:52:00+01:00\" class=\"wp-block-latest-posts__post-date\">10 August 2023<\/time><\/li>\n<li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2023\/08\/3-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/ms-biographs-ms200\/\">MS-BioGraphs MS200<\/a><time datetime=\"2023-08-10T09:51:00+01:00\" class=\"wp-block-latest-posts__post-date\">10 August 2023<\/time><\/li>\n<li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2023\/08\/7-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/ms-biographs-msa200\/\">MS-BioGraphs MSA200<\/a><time datetime=\"2023-08-10T09:50:00+01:00\" class=\"wp-block-latest-posts__post-date\">10 August 2023<\/time><\/li>\n<li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2023\/08\/4-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/ms-biographs-ms50\/\">MS-BioGraphs MS50<\/a><time datetime=\"2023-08-10T09:49:00+01:00\" class=\"wp-block-latest-posts__post-date\">10 August 2023<\/time><\/li>\n<li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2023\/08\/8-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/ms-biographs-msa50\/\">MS-BioGraphs MSA50<\/a><time datetime=\"2023-08-10T09:48:00+01:00\" class=\"wp-block-latest-posts__post-date\">10 August 2023<\/time><\/li>\n<li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2023\/08\/9-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/ms-biographs-msa10\/\">MS-BioGraphs MSA10<\/a><time datetime=\"2023-08-10T09:44:41+01:00\" class=\"wp-block-latest-posts__post-date\">10 August 2023<\/time><\/li>\n<li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2023\/08\/5-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/ms-biographs-ms1\/\">MS-BioGraphs MS1<\/a><time datetime=\"2023-08-10T09:41:14+01:00\" class=\"wp-block-latest-posts__post-date\">10 August 2023<\/time><\/li>\n<li><div class=\"wp-block-latest-posts__featured-image alignleft\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-content\/uploads\/sites\/14\/2023\/08\/11-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" style=\"max-width:60px;max-height:60px;\" \/><\/div><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/ms-biographs-validation\/\">MS-BioGraphs Validation<\/a><time datetime=\"2023-08-10T09:40:00+01:00\" class=\"wp-block-latest-posts__post-date\">10 August 2023<\/time><\/li>\n<\/ul>\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading has-medium-font-size\"><br><strong>Project Members<\/strong><\/h2>\n\n\n\n<p>&#8211; <a href=\"https:\/\/orcid.org\/0000-0002-7465-8003\" target=\"_blank\" rel=\"noreferrer noopener\">Mohsen Koohi Esfahani<\/a><br>&#8211; <a href=\"https:\/\/vigna.di.unimi.it\/\" target=\"_blank\" rel=\"noreferrer noopener\">Sebastiano Vigna, Universit\u00e0 degli Studi di Milano<\/a><br>&#8211; <a href=\"https:\/\/boldi.di.unimi.it\/\" target=\"_blank\" rel=\"noreferrer noopener\">Paolo Boldi, Universit\u00e0 degli Studi di Milano<\/a><br>&#8211; <a href=\"https:\/\/blogs.qub.ac.uk\/dipsa\/personal-page-hans-vandierendonck\/\">Hans Vandierendonck<\/a><br>&#8211; <a href=\"https:\/\/www.cs.qub.ac.uk\/~p.kilpatrick\/\" target=\"_blank\" rel=\"noreferrer noopener\">Peter Kilpatrick<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-justify has-medium-font-size\"><strong>Naming<\/strong><\/h2>\n\n\n\n<p class=\"has-text-align-justify\">The name of each graph is started by two characters <em>M<\/em> and <em>S<\/em> as initials of <a href=\"http:\/\/Metaclust.mmseqs.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Metaclust<\/a> (as the source dataset) and Sequence similarity (as the real-world domain of the graph), respectively. The name of the directed subgraphs has a third character <em>A<\/em> that indicates the graph is asymmetric. The name of subgraphs is followed by <em>3 digits<\/em> that show the relative-size of the subgraph in comparison to the MS graph, multiplied by a thousand.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-justify has-medium-font-size\"><strong>Grants and Funding<\/strong><\/h2>\n\n\n\n<p class=\"has-text-align-justify\">&#8211; Kelvin-2 supercomputer (UKRI EPSRC grant EP\/T022175\/1)<br>&#8211; PhD scholarship from The Department for the Economy, Northern Ireland and QUB<br>&#8211; Energy Efficient Transprecision Techniques for Linear system Solvers<br>&#8211; SERICS project (PE00000014) under the NRRP MUR program funded by the EU &#8211; NGEU<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-justify has-medium-font-size\"><strong>License<\/strong><\/h2>\n\n\n\n<p class=\"has-text-align-justify\">The datasets are published under the <a href=\"https:\/\/creativecommons.org\/licenses\/by-nc-sa\/2.0\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>CC BY-NC-SA<\/strong><\/a> license. <br>QUB IDF: 2223-052<\/p>\n\n\n\n<p>Last update: Sep. 13th, 2024<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-justify has-medium-font-size\"><strong>Acknowledgements<\/strong><\/h2>\n\n\n\n<p class=\"has-text-align-justify\">We are grateful to <br>&#8211; <a href=\"https:\/\/ieee-dataport.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">IEEE DataPort<\/a><br>&#8211; Sean McKeever, Head of IT, EEECS, QUB and his team<br>&#8211; Ian Overton, School of Medicine, Dentistry and Biomedical Sciences, QUB<br>&#8211; Vaughan Purnell, Head of NI-HPC and his team<br>&#8211; Jesus Martinez-del-Rincon and SPRC committee, EEECS, QUB<br>&#8211; Martin Frith, University of Tokyo<br>&#8211; Ariful Azad, Indiana University<br>&#8211; Eurcom, France<br>&#8211; Unsplash, Pixabay, and Plotly<br><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Project Statement Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets. We (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family [&hellip;]<\/p>\n","protected":false},"author":1315,"featured_media":2082,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-2309","page","type-page","status-publish","has-post-thumbnail","czr-hentry"],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/pages\/2309","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/users\/1315"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/comments?post=2309"}],"version-history":[{"count":84,"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/pages\/2309\/revisions"}],"predecessor-version":[{"id":3430,"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/pages\/2309\/revisions\/3430"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/media\/2082"}],"wp:attachment":[{"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/media?parent=2309"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}