MS-BioGraphs Validation

Repository

https://github.com/DIPSA-QUB/MS-BioGraphs-Validation

Explanation

We provide a Shell script, validation.sh, and a Java program, EdgeBlockSHA.java, to verify the the correctness of the graphs. Each graph has a .ojson file whose shasum is verified by the value retreived from our server. Files such as offsets.bin, wcc.bin, n2o.bin, trans_offsets.bin, and edges_shas.txt have shasum records in the ojson file which is used for validation of these files.

The graph in WebGraph format has been compressed in MS??-underlying.* and MS??-weights.* files. In order to validate the compressed graph, the EdgeBlockSHA.java is used. It is a parallel Java code that uses the WebGraph library to traverse the graph and calculate the shasum of blocks of edges (endpoints and weights). Then, the calculated results are matched with the edges_shas.txt file of the graph.

It is also possible to validate some particular blocks by matching the calculated shasum with the relevant row in the edges_shas.txt file. This file has a format such as the following. Each block contains 64 Million consecutive edges. The start of each block is identified by a vertex ID and its edge index. The Column endpoint_sha is the shasum of the 64 Million endpoints when stored as an array of 4-Bytes elements in the binary format and in the little endian order. Similarly, Column weights_sha shows the shasum of weights (labels). We have separated weights from endpoints as in some applications weights are not needed and therefore it is not necessary to read and validate them.

64MB blk#;     vertex; edge index;                             endpoint_sha;                              weights_sha;
         0;          0;          0; 509784b158cb9404241afb21d0ceaf590b88d2f2; 57da4ad7bb89c5922e436b0535d791fa8f40dffd;
         1;    2315113;        705; fafc118563c1d7b5fbff64af56edd6a56524f479; 13b7a9ca60bfb0715d563218d0a1cd787b00a07c;
         2;    4521625;        597; 4ed65aa07c8062a151166ef2e9bdb93e41d19357; 8158276bec426ee46eca9912759eb9bd57fcc957;
         3;    6347361;        112; d02e8913c807c3f4ecde9c638e0ded5ab80ba819; 26bc3296de65cba6ac539cd96b79ae6f7a4d37be;
         4;    8447869;         15; 61513c84db40124496cdf769516118b63598914f; 781b9f4372ac614e94d097017c756d015234deb6; 
 

Requirements

  • JDK with version > 15
  • jq
  • wget

WebGraph Framework

Please visit https://webgraph.di.unimi.it .

ParaGrapher Graph Loading API and Library

The WebGraph formats can also be read using the ParaGrapher library: https://blogs.qub.ac.uk/DIPSA/ParaGrapher/.

License

Licensed under the GNU v3 General Public License, as published by the Free Software Foundation. You must not use this Software except in compliance with the terms of the License. Unless required by applicable law or agreed upon in writing, this Software is distributed on an “as is” basis, without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose, neither express nor implied.

Copyright 2022-2023 The Queen’s University of Belfast, Northern Ireland, UK

MS-BioGraphs

Related Posts

Poster Presentation | Graphlet-based Filtering for High-Performance Subgraph Isomorphism Search

I presented a poster on “Graphlet-based Filtering for High-Performance Subgraph Isomorphism Search” at the 19th International Summer School on Advanced Computer Architecture and Compilation for High-performance Embedded Systems organised by HiPEAC at Fiuggi, Italy between July 9 – 15, 2023.


Summary

Introduction

Graphs are powerful tools for representing complex relationships in various domains such as social networks, bioinformatics, and computer networks. The Subgraph Isomorphism (SI) search problem is crucial in graph analytics, involving finding instances of a pattern graph within a larger data graph. This problem has applications in pattern discovery and graph database queries.

However, the SI problem is NP-complete, meaning that it becomes computationally intensive for larger graphs. To tackle this challenge, researchers have proposed heuristic algorithms that aim to speed up the SI search by pruning the search space through various techniques.

Three Stages of Heuristic Algorithms

These heuristic algorithms can be divided into three stages:

  1. Filtering: In this stage, a candidate set of data graph vertices is generated for each vertex of the pattern graph. The goal is to select a subset of data graph vertices that are potential matches for each pattern vertex. Two common filtering techniques are Label and Degree Filtering (LDF) and Neighbourhood Label Filtering (NLF). LDF ensures that only vertices with matching labels and sufficient degrees are considered, while NLF adds further constraints based on the labels of neighbours.
  2. Ordering: The vertices of the pattern graph are ordered for mapping with the candidate data vertices. The Highest Degree First (HDF) strategy is commonly used.
  3. Backtracking: A backtracking search is performed based on the ordered vertices to find subgraph isomorphisms.

Proposed Approach: Graphlet Filtering (GLF)

The proposed approach, Graphlet Filtering (GLF), introduces an additional stage to the heuristic algorithms. Graphlets are recurring small subgraphs, and orbits represent groups of vertices within these graphlets. By counting the occurrences of orbits in both pattern and data graphs, we propose that the count of an orbit in the pattern graph must be less than or equal to the count of the same orbit in the candidate data vertices’ graphs. This idea enhances the filtering stage by providing more stringent filtering criteria.

Preliminary Results

We conducted preliminary experiments on various datasets and found that GLF filtering provides additional filtering enhancements ranging from 4.2% to 15.38% depending on the dataset and pattern graph density. Across all datasets, the GLF technique improved filtering by 9.93% for sparse pattern graphs and 8.49% for dense pattern graphs.

Conclusion

The study suggests that incorporating graphlet-based filtering (GLF) into the existing heuristic algorithms for the Subgraph Isomorphism problem can lead to more effective filtering and pruning of the search space. We plan to explore the impact of GLF on the runtime of different heuristic algorithms. If successful, this technique could significantly reduce the execution time of algorithms used for tasks such as pattern discovery and graph database queries.


Download the poster and the abstract below.

ASCCED: Asynchronous Scientific Continuous Computations Exploiting Disaggregation

UKRI EPSRC Grant

The design of efficient and scalable scientific simulation software is reaching a critical point whereby continued advances are increasingly harder, more labour-intensive, and thus more expensive to achieve. This challenge emanates from the constantly evolving design of large-scale high-performance computing systems. World-leading (pre-)exascale systems, as well as their successors, are characterised by multi-million-scale parallel computing activities and a highly heterogeneous mix of processor types such as high-end many-core processors, Graphics Processing Units (GPU), machine learning accelerators, and various accelerators for compression, encryption and in-network processing. To make efficient use of these systems, scientific simulation software must be decomposed in various independent components and make simultaneous use of the variety of heterogeneous compute units.

Developing efficient, scalable scientific simulation software for these systems becomes increasingly harder as the limits of parallelism available in the simulation codes is approached. Moreover, the limit of parallelism cannot be reached in practice due to heterogeneity, system imbalances and synchronisation overheads. Scientific simulation software often persists over several decades. The software is optimised and re-optimised repeatedly as the design and scale of the target hardware evolves at a much faster pace, as impactful changes in the hardware may occur every few years. One may thus find that the guiding principles that underpin such software are outdated.

The ASCCED project will fundamentally change the status quo in the design of scientific simulation software by simplifying the design to reduce software development and maintenance effort, to facilitate performance optimisation, and to make software more robust to future evolution of computing hardware. The key distinguishing factor of our approach is to structure scientific simulation software as a collection of loosely coupled parallel activities. We will explore the opportunities and challenges of applying techniques previously developed for Parallel Discrete Event Simulation (PDES) to orchestrate these loosely coupled parallel activities. This radically novel approach will enable runtime system software to extract unprecedented scales of parallelism and to minimise performance inefficiencies due to synchronisation. Additionally, based on a speculative execution mechanism, it will uncover parallelism that has not been feasible to extract before.

The computational model proposed by ASCCED will, if successful, initiate a new direction of research within programming models for high-performance computing that may dramatically impact not only the performance of scientific simulation software, but can also reduce the engineering effort required to produce efficient scientific simulation software. It will have a profound impact on the sciences that are highly dependent on leadership computing capabilities, such as climate modeling and cancer research.

On Designing Structure-Aware High-Performance Graph Algorithms (PhD Thesis)

Mohsen Koohi Esfahani
Supervisors: Hans Vandierendonck and Peter Kilpatrick

Thesis in PDF format
Thesis on QUB Pure Portal

Graph algorithms find several usages in industry, science, humanities, and technology. The fast-growing size of graph datasets in the context of the processing model of the current hardware has resulted in different bottlenecks such as memory locality, work-efficiency, and load-balance that degrade the performance. To tackle these limitations, high-performance computing considers different aspects of the execution in order to design optimized algorithms through efficient usage of hardware resources.

The main idea in this thesis is to analyze the structure of graphs to exploit special features that are key to introduce new graph algorithms with optimized performance.

First, we study the structure of real-world graph datasets with skewed degree distribution and the applicability of graph relabeling algorithms as the main restructuring tools to improve performance and memory locality. To that end, we introduce novel locality metrics including Cache Miss Rate Degree Distribution, Effective Cache Size, Push Locality and Pull Locality, and Degree Range Decomposition.

Based on this structural analysis, we introduce the Uniform Memory Demands strategy that (i) recognizes diverse memory demands and behaviours as a source of performance inefficiency, (ii) separates contrasting memory demands into groups with uniform behaviours across each group, and (iii) designs bespoke data structures and algorithms for each group in order to satisfy memory demands with the lowest overhead.

We apply the Uniform Memory Demands strategy to design three graph algorithms with optimized performance: (i) the SAPCo Sort algorithm as a parallel counting sort algorithm that is faster than comparison-based sorting algorithms in degree-ordering of power-law graphs, (ii) the iHTL algorithm that optimizes locality in Sparse Matrix-Vector (SpMV) Multiplication graph algorithms by extracting dense subgraphs containing incoming edges to in-hubs and processing them in the push direction, and (iii) the LOTUS algorithm that optimizes locality in Triangle Counting by separating different caching demands and deploying specific data structure and algorithm for each of them.

Bibtex

@phdthesis{ODSAGA-ethos.874822,
  title  = {On Designing Structure-Aware High-Performance Graph Algorithms},
  author = {Mohsen Koohi Esfahani},
  year   = 2022,
  url    = {https://blogs.qub.ac.uk/DIPSA/On-Designing-Structure-Aware-High-Performance-Graph-Algorithms-PhD-Thesis/},
  school = {Queen's University Belfast},
  EThOSID = {uk.bl.ethos.874822}
}

Related Posts

LaganLighter

LaganLighter Source Code


Repository

https://github.com/DIPSA-QUB/LaganLighter

Documentation

https://github.com/DIPS-QUB/LaganLighter/tree/main/docs

Algorithms in This Repo

Cloning

git clone https://github.com/MohsenKoohi/LaganLighter.git --recursive

Graph Types

LaganLighter supports the following graph formats:

  • CSR/CSC graph in text format, for testing. This format has 4 lines: (i) number of vertices (|V|), (ii) number of edges (|E|), (iii) |V| space-separated numbers showing offsets of the vertices, and (iv) |E| space-separated numbers indicating edges.
  • CSR/CSC WebGraph format: supported by the Poplar Graph Loading Library
    external git repository

Measurements

In addition to execution time, we use the PAPI library to measure hardware counters such as L3 cache misses, hardware instructions, DTLB misses, and load and store memory instructions. ( papi_(init/start/reset/stop) and (print/reset)_hw_events functions defined in omp.c).

To measure load balance, we measure the total time of executing a loop and the time each thread spends in this loop (mt and ttimes in the following sample code). Using these values, PTIP macro (defined in omp.c) calculates the percentage of average idle time (as an indicator of load imbalance) and prints it with the total time (mt).

mt = - get_nano_time()
#pragma omp parallel  
{
   unsigned tid = omp_get_thread_num();
   ttimes[tid] = - get_nano_time();
	
   #pragma omp for nowait
   for(unsigned int v = 0; v < g->vertices_count; v++)
   {
      // .....
   }
   ttimes[tid] += get_nano_time();
}
mt += get_nano_time();
PTIP("Step ... ");

As an example, the following execution of Thrifty, shows that the “Zero Planting” step has been performed in 8.98 milliseconds and with a 8.22% load imbalance, while processors have been idle for 72.22% of the execution time, on average, in the “Initial Push” step.

NUMA-Aware and Locality-Preserving Partitioning and Scheduling

In order to assign consecutive partitions (vertices and/or their edges) to each parallel processor, we initially divide partitions and assign a number of consecutive partitions to each thread. Then, we specify the order of victim threads in the work-stealing process. During the initialization of LaganLighter parallel processing environment (in initialize_omp_par_env() function defined in file omp.c), for each thread, we create a list of threads as consequent victims of stealing.

A thread, first, steals jobs (i.e., partitions) from consequent threads in the same NUMA node and then from the threads in consequent NUMA nodes. As an example, the following image shows the stealing order of a 24-core machine with 2 NUMA nodes. This shows that thread 1 steals from threads 2, 3, …,11, and ,0 running on the same NUMA socket and then from threads 13, 14, …, 23, and 12 running on the next NUMA socket.

We use dynamic_partitioning_...() functions (in file partitioning.c) to process partitions by threads in the specified order. A sample code is in the following:

struct dynamic_partitioning* dp = dynamic_partitioning_initialize(pe, partitions_count);

#pragma omp parallel  
{
   unsigned int tid = omp_get_thread_num();
   unsigned int partition = -1U;		

   while(1)
   {
      partition = dynamic_partitioning_get_next_partition(dp, tid, partition);
      if(partition == -1U)
	 break; 

      for(v = start_vertex[partition]; v < start_vertex[partition + 1]; v++)
      {
	// ....
       }
   }
}

dynamic_partitioning_reset(dp);

Bugs & Support

As “we write bugs that in particular cases have been tested to work correctly”, we try to evaluate and validate the algorithms and their implementations. If you receive wrong results or you are suspicious about parts of the code, please contact us or submit an issue.

License

Licensed under the GNU v3 General Public License, as published by the Free Software Foundation. You must not use this Software except in compliance with the terms of the License. Unless required by applicable law or agreed upon in writing, this Software is distributed on an “as is” basis, without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose, neither express nor implied. For details see terms of the License.

Copyright 2022 The Queen’s University of Belfast, Northern Ireland, UK

LaganLighter

Related Posts

Technical Posts

Approximate Maximum Weighted Clique

This project aims to develop novel algorithms for the maximum weighted clique (MWC) problem, which appears in various data analysis pipelines in precision medicine. The MWC problem is NP-hard in nature, which makes it particularly challenging given the exponentially increasing amount of data it is applied to.

Although several attempts have been made to solve the maximum weighted clique problem in large graphs, there is still much opportunity for lowering the execution time necessary to find a satisfactory solution. In this project in particular we are investigating approximate algorithms for the MWC problem. We are working towards an algorithm that achieves a very high quality solution (i.e., finding a clique with weight very close to the MWC) in polynomial time.

IBM will provide industrially relevant context on knowledge extraction from graph-structured data. They have extensive experience in this area by building scalable software systems for the analysis of massive-scale graph data. They will moreover provide access to relevant datasets.

Project Members

Publications

  • QClique: Optimizing Performance and Accuracy in Maximum Weighted Clique, EURO-PAR 2024, link.

Funding

This PhD project is funded by the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie Actions.

MASTIFF: Structure-Aware Minimum Spanning Tree/Forest – ICS’22

36th ACM International Conference on Supercomputing 2022
June 27-30, 2022
Acceptance Rate: 25%

DOI: 10.1145/3524059.3532365
Authors’ Copy (PDF Format)

The Minimum Spanning Forest (MSF) problem finds usage in many different applications. While theoretical analysis shows that linear-time solutions exist, in practice, parallel MSF algorithms remain computationally demanding due to the continuously increasing size of data sets.

In this paper, we study the MSF algorithm from the perspective of graph structure and investigate the implications of the power-law degree distribution of real-world graphs
on this algorithm.

We introduce the MASTIFF algorithm as a structure-aware MSF algorithm that optimizes work efficiency by (1) dynamically tracking the largest forest component of each graph component and exempting them from processing, and (2) by avoiding topology-related operations such as relabeling and merging neighbour lists.

The evaluations on 2 different processor architectures with up to 128 cores and on graphs of up to 124 billion edges, shows that Mastiff is 3.4–5.9× faster than previous works.

Code Availability
The source-code of MASTIFF is available on LaganLighter Repository (alg3_mastiff.c and msf.c files). A sample execution of this source code for “Twitter-MPI” graph is shown in the following:

BibTex

@INPROCEEDINGS{10.1145/3524059.3532365,
author = {Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
title = {{MASTIFF}: Structure-Aware Minimum Spanning Tree/Forest},
year = {2022},
isbn = {},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3524059.3532365},
doi = {10.1145/3524059.3532365},
booktitle = {Proceedings of the 36th ACM International Conference on Supercomputing},
numpages = {13}
}

LaganLighter

Related Posts

Software-Defined Floating-Point Number Formats and Their Application to Graph Processing – ICS’22

DOI: 10.1145/3524059.3532360

This paper proposes software-defined floating-point number formats for graph processing workloads, which can improve performance in irregular workloads by reducing cache misses. Efficient arithmetic on software-defined number formats is challenging, even when based on conversion to wider, hardware-supported formats. We derive efficient conversion schemes that are tuned to the IA64 and AVX512 instruction sets.

We demonstrate that: (i) reduced-precision number formats can be applied to graph processing without loss of accuracy; (ii) conversion of floating-point values is possible
with minimal instructions; (iii) conversions are most efficient when utilizing vectorized instruction sets, specifically on IA64 processors.

Experiments on twelve real-world graph data sets demonstrate that our techniques result in speedups up to 89% for PageRank and Accelerated PageRank, and up to 35% for Single-Source Shortest Paths. The same techniques help to accelerate the integer-based maximal independent set problem by up to 262%.

SAPCo Sort: Optimizing Degree-Ordering for Power-Law Graphs – ISPASS’22 (Poster)

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2022)
May 22-24, 2022

DOI: 10.1109/ISPASS55109.2022.00015

Authors’ Copy (PDF)

While counting sort has a better complexity than comparison-based sorting algorithms, its parallelization suffers from high performance overhead and/or has a memory complexity that depends on the numbers of threads and elements.

In this paper, we explore the optimization of parallel counting sort for degree-ordering of real-world graphs with skewed degree distribution and we introduce the Structure-Aware Parallel Counting (SAPCo) Sort algorithm that leverages the skewed degree distribution to accelerate sorting.

The evaluation for graphs of up to 3.6 billion vertices shows that SAPCo sort is, on average, 1.7-33.5 times faster than state-of-the-art sorting algorithms such as counting sort, radix sort, and sample sort.

For a detailed explanation of the algorithms, please refer to Chapter 5 of the On Designing Structure-Aware High-Performance Graph Algorithms thesis.

BibTex

@INPROCEEDINGS{10.1109/ISPASS55109.2022.00015,
  author={Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
  booktitle={2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)}, 
  title={{SAPCo Sort}: Optimizing Degree-Ordering for Power-Law Graphs}, 
  year={2022},
  volume={},
  number={},
  pages={},
  publisher={IEEE Computer Society},
  doi={10.1109/ISPASS55109.2022.00015}
}

Code Availability
The source-code of SAPCo is available on LaganLighter Repository (alg1_sapco_sort.c and relabel.c files). A sample execution of this source code for “Twitter-MPI” graph is shown in the following:


LaganLighter

Related Posts

LOTUS: Locality Optimizing Triangle Counting – PPOPP’22

27th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP 2022)
April 2-6, 2022
Acceptance Rate: 31%

DOI: 10.1145/3503221.3508402

Authors’ Copy (PDF Format)

Triangle Counting (TC) is a basic graph algorithm and is widely used in different fields of science, humanities and technology. The great size of real-world graphs with skewed degree distribution is a prohibiting factor in efficient TC.

In this paper, we study the implications of the power-law degree distribution of graph datasets on memory utilization in TC and we explain how a great percentage of triangles are formed around a small number of vertices with very great degrees (that are called hubs).

Using these novel observations, we present the LOTUS algorithm as a structure-aware and locality optimizing TC that separates counting triangles formed by hub and non-hub edges. Lotus introduces bespoke TC algorithms and data structures for different edges in order to (i) reduce the size of data structures that are target of random memory accesses, (ii) to optimize cache capacity utilization, (iii) to provide better load balance, and (iv) to avoid fruitless searches.

To provide better CPU utilization, we introduce the Squared Edge Tiling algorithm that divides a large task into smaller ones when the size of work for each edge in the neighbour-list depends on the index of that edge.

We evaluate the Lotus algorithm on 3 different processor architectures and for real-world graph datasets with up to 162 billion edges that shows LOTUS provides 2.2-5.5 times speedup in comparison to the previous works.

  • LOTUS: Locality Optimizing Triangle Counting
  • LOTUS: Locality Optimizing Triangle Counting
  • LOTUS: Locality Optimizing Triangle Counting - Forward Algorithm
  • LOTUS: Locality Optimizing Triangle Counting - Analysis of Forward Algorithm
  • LOTUS: Locality Optimizing Triangle Counting - Analysis of Forward Algorithm
  • LOTUS: Locality Optimizing Triangle Counting - Analysis of Forward Algorithm
  • LOTUS: Locality Optimizing Triangle Counting - Analysis of Forward Algorithm
  • LOTUS: Locality Optimizing Triangle Counting - Lotus Graph Structure
  • LOTUS: Locality Optimizing Triangle Counting -
  • LOTUS: Locality Optimizing Triangle Counting - HHH & HHN
  • LOTUS: Locality Optimizing Triangle Counting - HNN
  • LOTUS: Locality Optimizing Triangle Counting - NNN
  • LOTUS: Locality Optimizing Triangle Counting - Evaluation
  • LOTUS: Locality Optimizing Triangle Counting - Conclusion
  • LOTUS: Locality Optimizing Triangle Counting -
  • LOTUS: Locality Optimizing Triangle Counting -

Code Availability
The source-code will be published soon.

BibTex

@INPROCEEDINGS{10.1145/3503221.3508402,
  author = {Koohi Esfahani, Mohsen and Kilpatrick, Peter and Vandierendonck, Hans},
  booktitle = {27th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP 2022)},  
  title = {LOTUS: Locality Optimizing Triangle Counting}, 
  year = {2022},
  numpages = {15},
  pages={219–233},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  doi = {10.1145/3503221.3508402}
}

LaganLighter

Related Posts