An Incomplete List of Publicly Available Graph Datasets/Generators

Real-World Graphs

Synthetic Graph Generators

Technical Posts

An Evaluation of Bandwidth of Different Storage Types (HDD vs. SSD vs. LustreFS) for Different Block Sizes and Different Parallel Read Methods (mmap vs pread vs read)

Short URL of this post: https://blogs.qub.ac.uk/DIPSA/HDD-vs-SSD-vs-LustreFS-2024

We evaluate read bandwidth of three storage types:

  • HDD: A 6TB Hitachi HUS726060AL 7200RPM SATA v3.1
  • SSD: A 4TB Samsung MZQL23T8HCLS-00A07 PCIe4 NVMe v1.4
  • LustreFS: A parallel file system with total 2PB with a SSD pool

and for three parallel read methods:

and for two block sizes:

  • 4 KB blocks
  • 4 MB blocks

The source code is available on ParaGrapher repository:

The OS cache of storage contents have been dropped after each evaluation
(sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches').
The flushcache.c file (https://github.com/DIPSA-QUB/ParaGrapher/blob/main/test/flushcache.c) can be used with the same functionality for users without sudo access, however, it usually takes more time to be finished.

For LustreFS, we have repeated the evaluation of read and pread using O_DIRECT flag as this flag prevents client-side caching.

For HDD and SSD experiments, we have used a machine with Intel W-2295 3.00GHz CPU, 18 cores, 36 hyper-threads, 24MB L3 cache, 256 GB DDR4 2933Mhz memory, running Debian 12 Linux 6.1. For LustreFS, we have used a machine with 2TB 3.2GHz DDR4 memory, 2 AMD 7702 CPUs, in total, 128 cores, 256 threads.

The results of the evaluation using read_bandwidth.c are in the following table. The values are Bandwidth in MB/s. Also, 1-2 digits close to each number with a white background are are percentage of load imbalance between parallel threads.

Please click on the image to expand.

For similar comparisons you may refer to:
https://github.com/david-slatinek/c-read-vs.-mmap/tree/main
https://eklausmeier.goip.de/blog/2016/02-03-performance-comparison-mmap-versus-read-versus-fread/

Technical Posts


ParaGrapher

SIMD Bit Twiddling Hacks

The Bit Twiddling Hacks website collects an array of useful code fragments that implement some very specific computations very efficiently. Here we collect references to some handy code fragments for SIMD based computation.

LaganLighter Source Code


Repository

https://github.com/DIPSA-QUB/LaganLighter

Algorithms in This Repo

Cloning

git clone https://github.com/MohsenKoohi/LaganLighter.git --recursive

Graph Types

LaganLighter supports the following graph formats:

  • CSR/CSC graph in text format, for testing. This format has 4 lines: (i) number of vertices (|V|), (ii) number of edges (|E|), (iii) |V| space-separated numbers showing offsets of the vertices, and (iv) |E| space-separated numbers indicating edges.
  • CSR/CSC WebGraph format: supported by the Poplar Graph Loading Library
    external git repository

Measurements

In addition to execution time, we use the PAPI library to measure hardware counters such as L3 cache misses, hardware instructions, DTLB misses, and load and store memory instructions. ( papi_(init/start/reset/stop) and (print/reset)_hw_events functions defined in omp.c).

To measure load balance, we measure the total time of executing a loop and the time each thread spends in this loop (mt and ttimes in the following sample code). Using these values, PTIP macro (defined in omp.c) calculates the percentage of average idle time (as an indicator of load imbalance) and prints it with the total time (mt).

mt = - get_nano_time()
#pragma omp parallel  
{
   unsigned tid = omp_get_thread_num();
   ttimes[tid] = - get_nano_time();
	
   #pragma omp for nowait
   for(unsigned int v = 0; v < g->vertices_count; v++)
   {
      // .....
   }
   ttimes[tid] += get_nano_time();
}
mt += get_nano_time();
PTIP("Step ... ");

As an example, the following execution of Thrifty, shows that the “Zero Planting” step has been performed in 8.98 milliseconds and with a 8.22% load imbalance, while processors have been idle for 72.22% of the execution time, on average, in the “Initial Push” step.

NUMA-Aware and Locality-Preserving Partitioning and Scheduling

In order to assign consecutive partitions (vertices and/or their edges) to each parallel processor, we initially divide partitions and assign a number of consecutive partitions to each thread. Then, we specify the order of victim threads in the work-stealing process. During the initialization of LaganLighter parallel processing environment (in initialize_omp_par_env() function defined in file omp.c), for each thread, we create a list of threads as consequent victims of stealing.

A thread, first, steals jobs (i.e., partitions) from consequent threads in the same NUMA node and then from the threads in consequent NUMA nodes. As an example, the following image shows the stealing order of a 24-core machine with 2 NUMA nodes. This shows that thread 1 steals from threads 2, 3, …,11, and ,0 running on the same NUMA socket and then from threads 13, 14, …, 23, and 12 running on the next NUMA socket.

We use dynamic_partitioning_...() functions (in file partitioning.c) to process partitions by threads in the specified order. A sample code is in the following:

struct dynamic_partitioning* dp = dynamic_partitioning_initialize(pe, partitions_count);

#pragma omp parallel  
{
   unsigned int tid = omp_get_thread_num();
   unsigned int partition = -1U;		

   while(1)
   {
      partition = dynamic_partitioning_get_next_partition(dp, tid, partition);
      if(partition == -1U)
	 break; 

      for(v = start_vertex[partition]; v < start_vertex[partition + 1]; v++)
      {
	// ....
       }
   }
}

dynamic_partitioning_reset(dp);

Bugs & Support

As “we write bugs that in particular cases have been tested to work correctly”, we try to evaluate and validate the algorithms and their implementations. If you receive wrong results or you are suspicious about parts of the code, please contact us or submit an issue.

License

Licensed under the GNU v3 General Public License, as published by the Free Software Foundation. You must not use this Software except in compliance with the terms of the License. Unless required by applicable law or agreed upon in writing, this Software is distributed on an “as is” basis, without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose, neither express nor implied. For details see terms of the License.

Copyright 2022 The Queen’s University of Belfast, Northern Ireland, UK

LaganLighter

Related Posts