An Incomplete List of Publicly Available Graph Datasets/Generators

Real-World Graphs

Synthetic Graph Generators

Technical Posts

QClique: Optimizing Performance and Accuracy in Maximum Weighted Clique – Euro-Par 2024

30th International European Conference on Parallel and Distributed Computing (Euro-Par 2024)

DOI: TBC
PDF Version

Abstract

The Maximum Weighted Clique(MWC) problem remains challenging due to its unfavourable time complexity.In this paper, we analyze the execution of exact search-based MWC algorithms and show that high-accuracy weighted cliques can be discovered in the early stages of the execution if searching the combinatorial space is performed systematically.

Based on this observation, we introduce QClique as an approximate MWC algorithm that processes the search space as long as better cliques are expected. QClique uses a tunable parameter to trade-off between accuracy vs. execution time and delivers 4.7-$82.3 time speedup in comparison to previous state-of-the-art MWC algorithms while providing 91.4% accuracy and achieves a parallel speedup of up to 56x on 128 threads.

Additionally, QClique accelerates the exact MWC computation by replacing the initial clique of the exact algorithm. For WLMC, an exact state-of-the-art MWC algorithm, this results in 3.3x on average.

Code

https://github.com/DIPSA-QUB/QClique

Selective Parallel Loading of Large-Scale Compressed Graphs with ParaGrapher – arXiv Version

PDF version
DOI: 10.48550/arXiv.2404.19735

Comprehensive evaluation is one of the basis of experimental science. In High-Performance Graph Processing, a thorough evaluation of contributions becomes more achievable by supporting common input formats over different frameworks. However, each framework creates its specific format, which may not support reading large-scale real-world graph datasets. This shows a demand for high-performance libraries capable of loading graphs to (i) accelerate designing new graph algorithms, (ii) to evaluate the contributions on a wide range of graph algorithms, and (iii) to facilitate easy and fast comparison over different graph frameworks.

To that end, we present ParaGrapher, a high-performance API and library for loading large-scale and compressed graphs. ParaGrapher supports different types of requests for accessing graphs in shared- and distributed-memory and out-of-core graph processing. We explain the design of ParaGrapher and present a performance model of graph decompression, which is used for evaluation of ParaGrapher over three storage types.

Our evaluation shows that by decompressing compressed graphs in WebGraph format, ParaGrapher delivers up to 3.2 times speedup in loading and up to 5.2 times speedup in end-to-end execution in comparison to the binary and textual formats.

ParaGrapher is available online on https://blogs.qub.ac.uk/DIPSA/ParaGrapher/.

BibTex

@misc{paragrapher-arxiv,
  title = { Selective Parallel Loading of Large-Scale 
            Compressed Graphs with ParaGrapher}, 
  author = { {Mohsen} {Koohi Esfahani} and Marco D'Antonio and 
             Syed Ibtisam Tauhidi and Thai Son Mai and 
             Hans Vandierendonck},
  year = {2024},
  eprint = {2404.19735},
  archivePrefix = {arXiv},
  primaryClass = {cs.AR},
  doi = {10.48550/arXiv.2404.19735}
}

Related Posts & Source Code

ParaGrapher Web Page

An Evaluation of Bandwidth of Different Storage Types (HDD vs. SSD vs. LustreFS) for Different Block Sizes and Different Parallel Read Methods (mmap vs pread vs read)

Short URL of this post: https://blogs.qub.ac.uk/DIPSA/HDD-vs-SSD-vs-LustreFS-2024

We evaluate read bandwidth of three storage types:

  • HDD: A 6TB Hitachi HUS726060AL 7200RPM SATA v3.1
  • SSD: A 4TB Samsung MZQL23T8HCLS-00A07 PCIe4 NVMe v1.4
  • LustreFS: A parallel file system with total 2PB with a SSD pool

and for three parallel read methods:

and for two block sizes:

  • 4 KB blocks
  • 4 MB blocks

The source code is available on ParaGrapher repository:

The OS cache of storage contents have been dropped after each evaluation
(sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches').
The flushcache.c file (https://github.com/DIPSA-QUB/ParaGrapher/blob/main/test/flushcache.c) can be used with the same functionality for users without sudo access, however, it usually takes more time to be finished.

For LustreFS, we have repeated the evaluation of read and pread using O_DIRECT flag as this flag prevents client-side caching.

For HDD and SSD experiments, we have used a machine with Intel W-2295 3.00GHz CPU, 18 cores, 36 hyper-threads, 24MB L3 cache, 256 GB DDR4 2933Mhz memory, running Debian 12 Linux 6.1. For LustreFS, we have used a machine with 2TB 3.2GHz DDR4 memory, 2 AMD 7702 CPUs, in total, 128 cores, 256 threads.

The results of the evaluation using read_bandwidth.c are in the following table. The values are Bandwidth in MB/s. Also, 1-2 digits close to each number with a white background are are percentage of load imbalance between parallel threads.

Please click on the image to expand.

For similar comparisons you may refer to:
https://github.com/david-slatinek/c-read-vs.-mmap/tree/main
https://eklausmeier.goip.de/blog/2016/02-03-performance-comparison-mmap-versus-read-versus-fread/

Technical Posts


ParaGrapher

MS-BioGraphs on IEEE DataPort

MS-BioGraph sequence similarity graph datasets are now publicly available on IEEE DataPort: https://doi.org/10.21227/gmd9-1534.

To access the files, you need to register/login to IEEE DataPort and then visit the MS-BioGraphs page. By saving the page as an HTML file such as dp.html, you may download the datasets (as an example MS1) using the following script:

dsname="MS1"
html_file="dp.html"

urls=`cat $html_file  | sed  -e 's/\&/\&/g'  | grep -Eo "(http|https)://[a-zA-Z0-9./?&=_%:-]*" | grep amazonaws  | sort | uniq | grep -E "$dsname[-_\.]"`

for u in $urls; do
    wget $u
    if [ $? != 0 ]; then break; fi
done

# removing query strings
for f in $(find $1 -type f); do
    if [ $f = ${f%%\?*} ]; then continue; fi
    mv "${f}" "${f%%\?*}"
done

# liking offsets.bin to be found by ParaGrapher
ln -s ${dsname}_offsets.bin ${dsname}-underlying_offsets.bin

Instead of wget you may use axel -n 10 to use multiple connections (here, 10) for downloading each file (https://manpages.ubuntu.com/manpages/noble/en/man1/axel.1.html).

MS-BioGraphs

Related Posts

ParaGrapher Integrated to LaganLighter

Poplar source code has been integrated to LaganLighter and access to different WebGraph formats are available in LaganLighter:

  • PARAGRAPHER_CSX_WG_400_AP
  • PARAGRAPHER_CSX_WG_404_AP
  • PARAGRAPHER_CSX_WG_800_AP

For further details, please refer to
– LaganLighter source coder Repository: https://github.com/DIPSA-QUB/LaganLighter, particularly, the graph.c file.
– ParaGrapher source code repository: https://github.com/DIPSA-QUB/ParaGrapher particularly, the src/webgraph.c and src/WG*.java files.

Read more about ParaGrapher and LaganLighter.

Related Posts

ParaGrapher Source Code For WebGraph Types

ParaGrapher source code for accessing WebGraphs have been published. The supported graph types are:

ParaGrapher uses its asynchronous and parallel API to implement these graph types. The user needs to implement a callback function that is called by the API upon completion of reading a block of edges. Poplar uses a shared memory for interaction between its C library and the Java library that deploys the WebGraph framework.

For further details, please refer to Poplar source code repository: https://github.com/DIPSA-QUB/ParaGrapher, particularly, src/webgraph.c and src/WG*.java files.

ParaGrapher

Related Posts

On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets – BigData’23 (Short Paper)

2023 IEEE International Conference on Big Data (BigData’23)
December 15-18, 2023, Sorrento, Italia

DOI: 10.1109/BigData59044.2023.10386309
PDF (Authors Copy)

Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular, is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets.

To ensure continuation of this progress, we (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family of publicly available real-world edge-weighted graph datasets with up to 2.5 trillion edges, that is, 6.6 times greater than the largest graph published recently. The largest graph is created by matching (i.e., all-to-all similarity aligning) 1.7 billion protein sequences. The MS-BioGraphs family includes also seven subgraphs with different sizes and direction types.

We describe two main challenges we faced in generating large graph datasets and our solutions, that are, (i) optimizing data structures and algorithms for this multi-step process and (ii) WebGraph parallel compression technique.

The datasets are available online on https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs.

BibTex

@INPROCEEDINGS{10.1109/BigData59044.2023.10386309,
   author = {Koohi Esfahani, Mohsen and Boldi, Paolo and Vandierendonck, Hans and Kilpatrick,  Peter and  Vigna, Sebastiano},  
  booktitle={2023 IEEE International Conference on Big Data (BigData'23)},  
  title={On Overcoming {HPC} Challenges of  Trillion-Scale Real-World Graph Datasets}, 
  year={2023},
  volume={},
  number={},
  pages={},
  location={Italia, Sorrento},
  publisher={IEEE Computer Society},
  doi={10.1109/BigData59044.2023.10386309}
}

MS-BioGraphs

Related Posts

Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs – IISWC’23 (Poster)

2023 IEEE International Symposium on Workload Characterization (IISWC’23)
October 1-3, 2023, Ghent, Belgium

DOI: 10.1109/IISWC59245.2023.00029
PDF Version

Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular, is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets.

In this paper, we announce publication of MS-BioGraphs, a new family of publicly-available real-world edge-weighted graph datasets with up to 2.5 trillion edges, that is, 6.6 times greater than the largest graph published recently.

We briefly review the two main challenges we faced in generating large graph datasets and our solutions, that are, (i) optimizing data structures and algorithms for this multi-step process and (ii) WebGraph parallel compression technique. We also study some characteristics of MS-BioGraphs.

The datasets are available on https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs .

Please visit https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Sequence-Similarity-Graph-Datasets/ for a complete version of this paper.

Bibtex

@INPROCEEDINGS{10.1109/IISWC59245.2023.00029,
   author = {Koohi Esfahani, Mohsen and Boldi, Paolo and Vandierendonck, Hans and Kilpatrick,  Peter and  Vigna, Sebastiano},  
  booktitle={2023 IEEE International Symposium on Workload Characterization (IISWC'23)},  
  title={Dataset Announcement: {MS-BioGraphs}, Trillion-Scale Public Real-World Sequence Similarity Graphs}, 
  year={2023},
  volume={},
  number={},
  pages={},
  location={Belgium, Ghent},
  publisher={IEEE Computer Society},
  doi={10.1109/IISWC59245.2023.00029}
}

MS-BioGraphs

Related Posts

MS-BioGraphs: Sequence Similarity Graph Datasets

DOI: 10.48550/arXiv.2308.16744

PDF Version
arXiv Link

Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular, is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets.

To ensure continuation of this progress, we (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family of publicly available real-world edge-weighted graph datasets with up to 2.5 trillion edges, that is, 6.6 times greater than the largest graph published recently. The largest graph is created by matching (i.e., all-to-all similarity aligning) 1.7 billion protein sequences. The MS-BioGraphs family includes also seven subgraphs with different sizes and direction types.

We describe two main challenges we faced in generating large graph datasets and our solutions, that are, (i) optimizing data structures and algorithms for this multi-step process and (ii) WebGraph parallel compression technique. We present a comparative study of structural characteristics of MS-BioGraphs.

The datasets are available online on https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs .

BibTex

@article{MS-BioGraphs-arxiv,
    title = {{MS-BioGraphs}: Sequence Similarity Graph Datasets},
    author = {Koohi Esfahani, Mohsen and Boldi, Paolo and Vandierendonck, Hans and Kilpatrick, Peter and Vigna, Sebastiano},
    year = 2023,
    journal = {CoRR},
    volume = {abs/2308.16744},
    doi = {10.48550/arXiv.2308.16744},
    url = {https://doi.org/10.48550/arXiv.2308.16744},
    archiveprefix = {arXiv},
    eprint = {2308.16744}
}

MS-BioGraphs

Related Posts