Random Vertex Relabelling in LaganLighter

Posted on 21 August 2024 by Mohsen Koohi Esfahani

To evaluate the impacts of locality-optimizing reordering algorithms, a baseline is required. To create the baseline a random assignment of IDs to vertices may be used to produce a representation of the graph with reduced locality [ DOI:10.1109/ISPASS57527.2023.00029, DOI:10.1109/IISWC53511.2021.00020 ].

To that end, we create the random_ordering() function in relabel.c file. It consists a number of iterations. In each iteration, concurrent threads traverse the list of vertices and assign them new IDs. The function uses xoshiro to produce random numbers.

The alg4_randomize tests this function for a number of graphs. For each dataset, an initial plot of degree distribution of Neighbor to Neighbor Average ID Distance (N2N AID) [DOI:10.1109/IISWC53511.2021.00020] is created. Also, after each iteration of random_ordering() the N2N AID distribution is plotted. This shows the impacts of randomization.

The complete results for all graphs can be seen in this PDF file. The results for some graphs are in the following.

The algorithm has been executed on a machine with two AMD 7401 CPUs, 128 cores, 128 threads. The report created by the launcher is in the following.

Technical Posts

Random Vertex Relabelling in LaganLighter21 August 2024
Minimum Spanning Forest of MS-BioGraphs9 August 2024
Topology-Based Thread Affinity Setting (Thread Pinning) in OpenMP3 August 2024
An (Incomplete) List of Publicly Available Graph Datasets/Generators21 June 2024
An Evaluation of Bandwidth of Different Storage Types (HDD vs. SSD vs. LustreFS) for Different Block Sizes and Different Parallel Read Methods (mmap vs pread vs read)20 April 2024
SIMD Bit Twiddling Hacks25 November 2023
LaganLighter Source Code14 November 2022

LaganLighter

On Optimizing Locality of Graph Transposition on Modern Architectures15 January 2025
Random Vertex Relabelling in LaganLighter21 August 2024
Minimum Spanning Forest of MS-BioGraphs9 August 2024
Topology-Based Thread Affinity Setting (Thread Pinning) in OpenMP3 August 2024
ParaGrapher Integrated to LaganLighter16 February 2024
On Designing Structure-Aware High-Performance Graph Algorithms (PhD Thesis)8 December 2022
LaganLighter Source Code14 November 2022
MASTIFF: Structure-Aware Minimum Spanning Tree/Forest – ICS’2228 June 2022
SAPCo Sort: Optimizing Degree-Ordering for Power-Law Graphs – ISPASS’22 (Poster)23 May 2022
LOTUS: Locality Optimizing Triangle Counting – PPOPP’225 April 2022
Locality Analysis of Graph Reordering Algorithms – IISWC’218 November 2021
Thrifty Label Propagation: Fast Connected Components for Skewed-Degree Graphs – IEEE CLUSTER’219 September 2021
Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing – ICPP’219 August 2021
How Do Graph Relabeling Algorithms Improve Memory Locality? ISPASS’21 (Poster)28 March 2021

Minimum Spanning Forest of MS-BioGraphs

Posted on 9 August 2024 by Mohsen Koohi Esfahani

We use MASTIFF to compute the weight of Minimum Spanning Forest (MST) of MS-BioGraphs while ignoring self-edges of the graphs.

– MS1

Using machine with 24 cores.

MSF weight: 109,915,787,546

– MS50

Using machine with 128 cores.

MSF weight: 416,318,200,808

MS-BioGraphs
Related Posts

Minimum Spanning Forest of MS-BioGraphs9 August 2024
MS-BioGraphs on IEEE DataPort17 April 2024
ParaGrapher Source Code For WebGraph Types16 February 2024
On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets – BigData’23 (Short Paper)15 December 2023
Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs – IISWC’23 (Poster)2 October 2023
MS-BioGraphs: Sequence Similarity Graph Datasets30 August 2023
MS-BioGraphs MS10 August 2023
MS-BioGraphs MSA50010 August 2023
MS-BioGraphs MS20010 August 2023
MS-BioGraphs MSA20010 August 2023
MS-BioGraphs MS5010 August 2023
MS-BioGraphs MSA5010 August 2023
MS-BioGraphs MSA1010 August 2023
MS-BioGraphs MS110 August 2023
MS-BioGraphs Validation10 August 2023

Technical Posts

Random Vertex Relabelling in LaganLighter21 August 2024
Minimum Spanning Forest of MS-BioGraphs9 August 2024
Topology-Based Thread Affinity Setting (Thread Pinning) in OpenMP3 August 2024
An (Incomplete) List of Publicly Available Graph Datasets/Generators21 June 2024
An Evaluation of Bandwidth of Different Storage Types (HDD vs. SSD vs. LustreFS) for Different Block Sizes and Different Parallel Read Methods (mmap vs pread vs read)20 April 2024
SIMD Bit Twiddling Hacks25 November 2023
LaganLighter Source Code14 November 2022

LaganLighter

On Optimizing Locality of Graph Transposition on Modern Architectures15 January 2025
Random Vertex Relabelling in LaganLighter21 August 2024
Minimum Spanning Forest of MS-BioGraphs9 August 2024
Topology-Based Thread Affinity Setting (Thread Pinning) in OpenMP3 August 2024
ParaGrapher Integrated to LaganLighter16 February 2024
On Designing Structure-Aware High-Performance Graph Algorithms (PhD Thesis)8 December 2022
LaganLighter Source Code14 November 2022
MASTIFF: Structure-Aware Minimum Spanning Tree/Forest – ICS’2228 June 2022
SAPCo Sort: Optimizing Degree-Ordering for Power-Law Graphs – ISPASS’22 (Poster)23 May 2022
LOTUS: Locality Optimizing Triangle Counting – PPOPP’225 April 2022
Locality Analysis of Graph Reordering Algorithms – IISWC’218 November 2021
Thrifty Label Propagation: Fast Connected Components for Skewed-Degree Graphs – IEEE CLUSTER’219 September 2021
Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing – ICPP’219 August 2021
How Do Graph Relabeling Algorithms Improve Memory Locality? ISPASS’21 (Poster)28 March 2021

Topology-Based Thread Affinity Setting (Thread Pinning) in OpenMP

Posted on 3 August 2024 by Mohsen Koohi Esfahani

In applications such as graph processing, it is important how threads are pinned on CPU cores as the threads that share resources (such as memory and cache) can accelerate the performance by processing consecutive blocks of input dataset, especially, when the dataset has a high-level of locality.

In LaganLighter, we read the CPU topology to specify how OpenMP threads are pinned. In omp.c file, the block starting with comment “Reading sibling groups of each node“, reads the “/sys/devices/system/cpu/cpu*/topology/thread_siblings” files to identify the sibling threads and three arrays ("node_sibling_groups_start_ID“, “sibling_group_cpus_start_offsets“, and “sibling_groups_cpus“) are used to store the sibling CPUs.

Then, in block starting with comment “Setting affinity of threads“, the sibling groups are read and based on the total number of threads requested by user, a number of threads with consecutive IDs are pinned to sibling CPUs.

For a machine with 24 cores, 48 hyperthreads, when 48 threads are requested, we have:

If 96 threads are created, we have:

Technical Posts

Random Vertex Relabelling in LaganLighter21 August 2024
Minimum Spanning Forest of MS-BioGraphs9 August 2024
Topology-Based Thread Affinity Setting (Thread Pinning) in OpenMP3 August 2024
An (Incomplete) List of Publicly Available Graph Datasets/Generators21 June 2024
An Evaluation of Bandwidth of Different Storage Types (HDD vs. SSD vs. LustreFS) for Different Block Sizes and Different Parallel Read Methods (mmap vs pread vs read)20 April 2024
SIMD Bit Twiddling Hacks25 November 2023
LaganLighter Source Code14 November 2022

LaganLighter

On Optimizing Locality of Graph Transposition on Modern Architectures15 January 2025
Random Vertex Relabelling in LaganLighter21 August 2024
Minimum Spanning Forest of MS-BioGraphs9 August 2024
Topology-Based Thread Affinity Setting (Thread Pinning) in OpenMP3 August 2024
ParaGrapher Integrated to LaganLighter16 February 2024
On Designing Structure-Aware High-Performance Graph Algorithms (PhD Thesis)8 December 2022
LaganLighter Source Code14 November 2022
MASTIFF: Structure-Aware Minimum Spanning Tree/Forest – ICS’2228 June 2022
SAPCo Sort: Optimizing Degree-Ordering for Power-Law Graphs – ISPASS’22 (Poster)23 May 2022
LOTUS: Locality Optimizing Triangle Counting – PPOPP’225 April 2022
Locality Analysis of Graph Reordering Algorithms – IISWC’218 November 2021
Thrifty Label Propagation: Fast Connected Components for Skewed-Degree Graphs – IEEE CLUSTER’219 September 2021
Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing – ICPP’219 August 2021
How Do Graph Relabeling Algorithms Improve Memory Locality? ISPASS’21 (Poster)28 March 2021

An (Incomplete) List of Publicly Available Graph Datasets/Generators

Posted on 21 June 2024 by Mohsen Koohi Esfahani

Short URL of this post: https://blogs.qub.ac.uk/DIPSA/graphs-list-2024

Real-World Graphs

Smaller Graphs

Synthetic Graph Generators

Graph500, [OSTI ID: 1014641], https://github.com/graph500/graph500/tree/newreference/generator
KaGen, [DOI: 10.1109/IPDPS.2018.00043], https://github.com/KarlsruheGraphGeneration/KaGen
GTgraph, https://github.com/Bader-Research/GTgraph
Smooth Kronocker, [DOI: 10.1145/3398682.3399161], https://github.com/dmargo/smooth_kron_gen

Technical Posts

Random Vertex Relabelling in LaganLighter21 August 2024
Minimum Spanning Forest of MS-BioGraphs9 August 2024
Topology-Based Thread Affinity Setting (Thread Pinning) in OpenMP3 August 2024
An (Incomplete) List of Publicly Available Graph Datasets/Generators21 June 2024
An Evaluation of Bandwidth of Different Storage Types (HDD vs. SSD vs. LustreFS) for Different Block Sizes and Different Parallel Read Methods (mmap vs pread vs read)20 April 2024
SIMD Bit Twiddling Hacks25 November 2023
LaganLighter Source Code14 November 2022

An Evaluation of Bandwidth of Different Storage Types (HDD vs. SSD vs. LustreFS) for Different Block Sizes and Different Parallel Read Methods (mmap vs pread vs read)

Posted on 20 April 2024 by Mohsen Koohi Esfahani

Short URL of this post: https://blogs.qub.ac.uk/DIPSA/HDD-vs-SSD-vs-LustreFS-2024

We evaluate read bandwidth of three storage types:

HDD: A 6TB Hitachi HUS726060AL 7200RPM SATA v3.1
SSD: A 4TB Samsung MZQL23T8HCLS-00A07 PCIe4 NVMe v1.4
LustreFS: A parallel file system with total 2PB with a SSD pool

and for three parallel read methods:

mmap: https://man7.org/linux/man-pages/man2/mmap.2.html
pread: https://man7.org/linux/man-pages/man2/pread.2.html
read: https://man7.org/linux/man-pages/man2/read.2.html

and for two block sizes:

4 KB blocks
4 MB blocks

The source code is available on ParaGrapher repository:

https://github.com/DIPSA-QUB/ParaGrapher/blob/main/test/read_bandwidth.c
https://github.com/DIPSA-QUB/ParaGrapher/blob/main/test/read_bandwidth_mam.c : this file is similar to previous one, but repeats each evaluation for a user-defined number of rounds and identifies Min, Average, and Max. values.

The OS cache of storage contents have been dropped after each evaluation
(sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches').
The flushcache.c file (https://github.com/DIPSA-QUB/ParaGrapher/blob/main/test/flushcache.c) can be used with the same functionality for users without sudo access, however, it usually takes more time to be finished.

For LustreFS, we have repeated the evaluation of read and pread using O_DIRECT flag as this flag prevents client-side caching.

For HDD and SSD experiments, we have used a machine with Intel W-2295 3.00GHz CPU, 18 cores, 36 hyper-threads, 24MB L3 cache, 256 GB DDR4 2933Mhz memory, running Debian 12 Linux 6.1. For LustreFS, we have used a machine with 2TB 3.2GHz DDR4 memory, 2 AMD 7702 CPUs, in total, 128 cores, 256 threads.

The results of the evaluation using read_bandwidth.c are in the following table. The values are Bandwidth in MB/s. Also, 1-2 digits close to each number with a white background are are percentage of load imbalance between parallel threads.

Please click on the image to expand.

C vs. Java

We measure the bandwidth of SSD and HDD in C (mmap and pread) vs. Java (mmap and read). We use a machine with Intel W-2295 3.00GHz CPU, 18 cores, 36 hyper-threads, 24MB L3 cache, 256 GB DDR4 2933Mhz memory, running Debian 12 Linux 6.1 and the following codes:

https://github.com/DIPSA-QUB/ParaGrapher/blob/main/test/read_bandwidth.c
https://github.com/DIPSA-QUB/ParaGrapher/blob/main/test/ReadBandwidth.java : this is a Java-based evaluation of read bandwidth and the script (https://github.com/DIPSA-QUB/ParaGrapher/blob/main/test/java-read-bandwidth.sh) can be used to create changes in evaluation parameters.

The results are in the following.

Technical Posts

Random Vertex Relabelling in LaganLighter21 August 2024
Minimum Spanning Forest of MS-BioGraphs9 August 2024
Topology-Based Thread Affinity Setting (Thread Pinning) in OpenMP3 August 2024
An (Incomplete) List of Publicly Available Graph Datasets/Generators21 June 2024
An Evaluation of Bandwidth of Different Storage Types (HDD vs. SSD vs. LustreFS) for Different Block Sizes and Different Parallel Read Methods (mmap vs pread vs read)20 April 2024
SIMD Bit Twiddling Hacks25 November 2023
LaganLighter Source Code14 November 2022

ParaGrapher

Selective Parallel Loading of Large-Scale Compressed Graphs with ParaGrapher – arXiv Version1 May 2024
An Evaluation of Bandwidth of Different Storage Types (HDD vs. SSD vs. LustreFS) for Different Block Sizes and Different Parallel Read Methods (mmap vs pread vs read)20 April 2024
ParaGrapher Integrated to LaganLighter16 February 2024
ParaGrapher Source Code For WebGraph Types16 February 2024

SIMD Bit Twiddling Hacks

Posted on 25 November 2023 by Hans Vandierendonck

The Bit Twiddling Hacks website collects an array of useful code fragments that implement some very specific computations very efficiently. Here we collect references to some handy code fragments for SIMD based computation.

AVX2 Population Count: Mula’s algorithm
AVX2 Count Leading Zeros for 8-bit integers
AVX2 Count Leading Zeroes for 32-bit integers
AVX512 Alternatives faster than VP2INTERSECT
AVX2 Setting single bit in SIMD register

Technical Posts

Random Vertex Relabelling in LaganLighter21 August 2024
Minimum Spanning Forest of MS-BioGraphs9 August 2024
Topology-Based Thread Affinity Setting (Thread Pinning) in OpenMP3 August 2024
An (Incomplete) List of Publicly Available Graph Datasets/Generators21 June 2024
An Evaluation of Bandwidth of Different Storage Types (HDD vs. SSD vs. LustreFS) for Different Block Sizes and Different Parallel Read Methods (mmap vs pread vs read)20 April 2024
SIMD Bit Twiddling Hacks25 November 2023
LaganLighter Source Code14 November 2022

LaganLighter Source Code

Posted on 14 November 2022 by Mohsen Koohi Esfahani

Repository

https://github.com/DIPSA-QUB/LaganLighter

Documentation

https://github.com/DIPS-QUB/LaganLighter/tree/main/docs

Algorithms in This Repo

SAPCo Sort: alg1_sapco_sort
Thrifty Label Propagation Connected Components: alg2_thrifty
MASTIFF: Structure-Aware Mimum Spanning Tree/Forest: alg3_mastiff
iHTL: in-Hub Temporal Locality in SpMV (Sparse-Matrix Vector Multiplication) based Graph Processing: (to be added)
LOTUS: Locality Optimizing Trinagle Counting: (to be added)

Cloning

git clone https://github.com/MohsenKoohi/LaganLighter.git --recursive

Graph Types

LaganLighter supports the following graph formats:

CSR/CSC graph in text format, for testing. This format has 4 lines: (i) number of vertices (|V|), (ii) number of edges (|E|), (iii) |V| space-separated numbers showing offsets of the vertices, and (iv) |E| space-separated numbers indicating edges.
CSR/CSC WebGraph format: supported by the Poplar Graph Loading Library
external git repository

Measurements

In addition to execution time, we use the PAPI library to measure hardware counters such as L3 cache misses, hardware instructions, DTLB misses, and load and store memory instructions. ( papi_(init/start/reset/stop) and (print/reset)_hw_events functions defined in omp.c).

To measure load balance, we measure the total time of executing a loop and the time each thread spends in this loop (mt and ttimes in the following sample code). Using these values, PTIP macro (defined in omp.c) calculates the percentage of average idle time (as an indicator of load imbalance) and prints it with the total time (mt).

mt = - get_nano_time()
#pragma omp parallel  
{
   unsigned tid = omp_get_thread_num();
   ttimes[tid] = - get_nano_time();
	
   #pragma omp for nowait
   for(unsigned int v = 0; v < g->vertices_count; v++)
   {
      // .....
   }
   ttimes[tid] += get_nano_time();
}
mt += get_nano_time();
PTIP("Step ... ");

As an example, the following execution of Thrifty, shows that the “Zero Planting” step has been performed in 8.98 milliseconds and with a 8.22% load imbalance, while processors have been idle for 72.22% of the execution time, on average, in the “Initial Push” step.

NUMA-Aware and Locality-Preserving Partitioning and Scheduling

In order to assign consecutive partitions (vertices and/or their edges) to each parallel processor, we initially divide partitions and assign a number of consecutive partitions to each thread. Then, we specify the order of victim threads in the work-stealing process. During the initialization of LaganLighter parallel processing environment (in initialize_omp_par_env() function defined in file omp.c), for each thread, we create a list of threads as consequent victims of stealing.

A thread, first, steals jobs (i.e., partitions) from consequent threads in the same NUMA node and then from the threads in consequent NUMA nodes. As an example, the following image shows the stealing order of a 24-core machine with 2 NUMA nodes. This shows that thread 1 steals from threads 2, 3, …,11, and ,0 running on the same NUMA socket and then from threads 13, 14, …, 23, and 12 running on the next NUMA socket.

We use dynamic_partitioning_...() functions (in file partitioning.c) to process partitions by threads in the specified order. A sample code is in the following:

struct dynamic_partitioning* dp = dynamic_partitioning_initialize(pe, partitions_count);

#pragma omp parallel  
{
   unsigned int tid = omp_get_thread_num();
   unsigned int partition = -1U;		

   while(1)
   {
      partition = dynamic_partitioning_get_next_partition(dp, tid, partition);
      if(partition == -1U)
	 break; 

      for(v = start_vertex[partition]; v < start_vertex[partition + 1]; v++)
      {
	// ....
       }
   }
}

dynamic_partitioning_reset(dp);

Bugs & Support

As “we write bugs that in particular cases have been tested to work correctly”, we try to evaluate and validate the algorithms and their implementations. If you receive wrong results or you are suspicious about parts of the code, please contact us or submit an issue.

License

Licensed under the GNU v3 General Public License, as published by the Free Software Foundation. You must not use this Software except in compliance with the terms of the License. Unless required by applicable law or agreed upon in writing, this Software is distributed on an “as is” basis, without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose, neither express nor implied. For details see terms of the License.

Copyright 2022 The Queen’s University of Belfast, Northern Ireland, UK

LaganLighter

On Optimizing Locality of Graph Transposition on Modern Architectures15 January 2025
Random Vertex Relabelling in LaganLighter21 August 2024
Minimum Spanning Forest of MS-BioGraphs9 August 2024
Topology-Based Thread Affinity Setting (Thread Pinning) in OpenMP3 August 2024
ParaGrapher Integrated to LaganLighter16 February 2024
On Designing Structure-Aware High-Performance Graph Algorithms (PhD Thesis)8 December 2022
LaganLighter Source Code14 November 2022
MASTIFF: Structure-Aware Minimum Spanning Tree/Forest – ICS’2228 June 2022
SAPCo Sort: Optimizing Degree-Ordering for Power-Law Graphs – ISPASS’22 (Poster)23 May 2022
LOTUS: Locality Optimizing Triangle Counting – PPOPP’225 April 2022
Locality Analysis of Graph Reordering Algorithms – IISWC’218 November 2021
Thrifty Label Propagation: Fast Connected Components for Skewed-Degree Graphs – IEEE CLUSTER’219 September 2021
Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing – ICPP’219 August 2021
How Do Graph Relabeling Algorithms Improve Memory Locality? ISPASS’21 (Poster)28 March 2021

Technical Posts

Random Vertex Relabelling in LaganLighter21 August 2024
Minimum Spanning Forest of MS-BioGraphs9 August 2024
Topology-Based Thread Affinity Setting (Thread Pinning) in OpenMP3 August 2024
An (Incomplete) List of Publicly Available Graph Datasets/Generators21 June 2024
An Evaluation of Bandwidth of Different Storage Types (HDD vs. SSD vs. LustreFS) for Different Block Sizes and Different Parallel Read Methods (mmap vs pread vs read)20 April 2024
SIMD Bit Twiddling Hacks25 November 2023
LaganLighter Source Code14 November 2022

DIPSA: Data-Intensive Parallel Systems and Algorithms

Category Archives: Technical Posts

Random Vertex Relabelling in LaganLighter

Minimum Spanning Forest of MS-BioGraphs

– MS1

– MS50

Topology-Based Thread Affinity Setting (Thread Pinning) in OpenMP

An (Incomplete) List of Publicly Available Graph Datasets/Generators

An Evaluation of Bandwidth of Different Storage Types (HDD vs. SSD vs. LustreFS) for Different Block Sizes and Different Parallel Read Methods (mmap vs pread vs read)

C vs. Java

SIMD Bit Twiddling Hacks

LaganLighter Source Code

Repository

Algorithms in This Repo

Cloning

Graph Types

Measurements

NUMA-Aware and Locality-Preserving Partitioning and Scheduling

Bugs & Support

License

Related Posts