Locality Analysis of Graph Reordering Algorithms

—A major challenge in processing real-world graphs stems from poor locality of memory accesses and vertex reordering algorithms (RAs) have been proposed to improve locality by changing the order of memory accesses. While state-of-the-art RAs like SlashBurn, GOrder, and Rabbit-Order effectively speed up graph algorithms, their ca-pabilities and disadvantages are not fully understood, mainly, for three reasons: (1) the large size of datasets, (2) the lack of suitable measurement tools, and (3) disparate characteristics of graphs. The paucity of analysis has also inhibited the design of more efﬁcient RAs. This paper unlocks this black box by introducing a number of tools, including: (1) a cache simulation technique for processing large graphs, (2) the Neighbour to Neighbour Average ID Distance (N2N AID) as a spatial locality metric, (3) the degree distributions of simulated cache miss rate and AID to investigate how locality of different vertices is affected by RAs, and (4) “effective cache size” to measure how much of cache capacity is used to support random accesses. We introduce (1) asymmetricity degree distribution, (2) degree range decomposition, and (3) push and pull locality to present a structural analysis of different types of real-world graphs by explaining their contrasting behaviours in confronting RAs. Finally, we propose a number of improvements for RAs using the analysis provided in this paper.


I. INTRODUCTION
Among data-intensive problems, graph processing is particularly challenging due to high memory bandwidth requirements. Many real-world graphs, such as those derived from social networks and the world-wide web, show a heavy-tailed or power-law degree distribution, i.e., a very small fraction of the vertices are connected to a disproportionately large fraction of the edges. The large size of these graphs results in seemingly-random memory accesses in graph traversals that cannot be completely satisfied by the processor caches and necessitates improving locality of accesses.
Locality is defined as "the tendency for programs to cluster references to subsets of address space for extended periods" [1], and graph relabeling (also called "reordering" or "renumbering") algorithms try to increase the cache hit rate by changing the order in which vertices are processed, and thus the order in which random memory accesses are made.
A relabeling algorithm (RA) improves locality of a graph traversal by assigning new IDs to vertices in a way that increases the clustering of memory accesses into a range that can be mostly satisfied by cache contents. However, identifying the optimal order that delivers the best locality is a NP-complete problem [2] and heuristics are employed in RAs [2]- [19] based on assumptions about graph structure or execution environment.
Some studies investigate the impact of RAs on graph analytics [20]- [23] by evaluating the general effect of RAs based on the execution time of graph analytics and do not explain how RAs work or how they affect locality of different graphs in different ways, useful for some, neutral or destructive for others. In order to reach effective and applicable locality optimizing algorithms, there is still a need to understand the strengths and weaknesses of previous efforts.
A key stumbling block to analyzing RAs is the availability of suitable metrics and tools. Numerous metrics are available, but none is fully effective in providing insight into vertex relabeling. Graph topology metrics [24]- [26] summarize the characteristics of graphs independently of execution properties like processing order and vertex ID assignment. As such, they are great for analyzing the graph, but do not reflect on execution efficiency. Reuse distance curves [27]- [30] are an established means to assess the general degree of locality in programs. In the case of graph processing, reuse distance distributions are determined by the processing order and vertex IDs; however, they do not facilitate analyzing the effectiveness (or shortcomings) of RAs. Moreover, reuse distance curves are practical only for comparing locality of a graph as a whole and do not reveal detailed information about the impact of RAs. The large size of graphs is another source of problems that makes it highly time-consuming to visualize graphs, to simulate execution, or to apply Monte Carlo-style trial and error methods to find patterns in the execution. What is lacking are light-weight metrics and techniques to analyze locality at a finer scale than the whole graph.
This paper studies three state-of-the-art RAs: Slash-Burn [10], GOrder [2], and Rabbit-Order [11]. We identify different locality types in a parallel graph processing environment. Then we introduce a bespoke technique for each RA to explain how it affects locality. We use real execution metrics (such as execution time, last level misses and DTLB misses) and degree distribution of introduced metrics to compare contrasting effects of RAs on different graphs. To explain these effects, we present a structural analysis of graph datasets.
The contributions of this paper are thus: • Introducing locality types in a parallel graph traversal, • Introducing the Neighbour to Neighbour Average ID Distance (N2N AID) as a spatial locality metric, • Introducing the degree distributions of simulated cache miss rates and AID to study impacts of RAs on vertex classes, • Introducing degree range decomposition and the degree distribution of asymmetricity to provide structural analysis of different graph types, and • A characterization of how locality manifests itself differently in a push traversal vs a pull traversal. Section II describes the background materials and terminology. Section III explains the methodology and the datasets. Section IV presents the RAs studied in this paper and introduces the locality types. We introduce the graph-specific cache simulation technique and the AID metric in Section V. Section VI analyzes the RAs by extending and improving our previous study [31]. Section VII demonstrates a structural analysis of datasets. Finally, we present improvements to RAs and future work in Section VIII.

A. Graph Representation
Graph G = (V, E) has a set of vertices V , and a set of directed edges E. The adjacency matrix is a binary matrix representing the graph: the element at row i and column j is 1 if E contains an edge from vertex i to j, and 0 otherwise.
We use the Compressed Sparse Columns (CSC), and Compressed Sparse Rows (CSR) formats [32] for representing topology data of graphs. CSC and CSR use two arrays: (1) an offsets array containing |V | + 1 elements, and (2) an edges array of |E| elements. The offsets array is indexed by a vertex ID and specifies the index of the first edge of that vertex in the edges array. The edges array specifies the ID of source and destination of edges in CSC and CSR, respectively. Each element of the offsets array has a size of 8 Bytes and each element of the edges array has a size of 4 Bytes.
In addition to topology data, the data of vertices is stored in an array of |V | elements and is indexed by a vertex ID.
We use graph average degree ( |E| |V | ) as the threshold between low-degree vertices (LDV) and high-degree vertices (HDV). Vertices with degree larger than |V | are called hubs (borrowed from huge node definition in GOrder). Hubs are divided into in-hubs and out-hubs. A vertex can be an in-hub if its indegree (the number of vertices that have edges to that vertex) is greater than |V | and can be an out-hub if the out-degree is greater than |V |.

B. Graph Traversal Model
There are many variations to graph traversal patterns depending on the graph analytics algorithm evaluated. In order to present a single, widely applicable analysis, we focus on a Sparse Matrix-Vector (SpMV) multiplication graph traversal model (Algorithm 1) that traverses all edges of Algorithm 1: SpMV graph traversal the graph. SpMV underpins several graph analytics like Hyperlink Induced Topic Search [33], Belief Propagation [34], Graph Neural Networks [35], Recurrent Neural Networks [36], PageRank [37], and Community Detection [38], and is the target structure of RAs.
SpMV traverses all edges of the graph which allows it to reveal the maximum improvement provided by RAs. Other graph analytics such as Breadth-First Search (BFS), Connected Components (CC), and Single Source Shortest Path (SSSP) selectively traverse edges as their execution is organized around a frontier (worklist). For instance, the frontier in BFS and SSSP is dependent on the start vertex. So, the memory access pattern (i.e., locality) are unpredictable as they depend on the specifics of the worklist content. Nonetheless, these algorithms have dense phases where all or the majority of the edges are processed (similar to SpMV). As the dense phases dominate the execution time, SpMV is also a suitable representative of these graph analytics.

C. SpMV
Algorithm 1 demonstrates the SpMV in pull direction that we use to investigate RAs in this paper. The outer loop (Lines 1-5) traverses vertices and the inner loop (Lines 3-4) traverses all incoming edges to the vertex. In iteration i, the vertex data (D i+1 ) is calculated using the vertex data (D i ) of neighbours.

D. Random and Sequential Memory Accesses
As a result of the CSC graph representation, memory accesses for reading neighbours of a vertex (Line 3 of Algorithm 1) are performed sequentially and are accelerated by hardware prefetchers. Moreover, each edge in the edges array is accessed only once during a SpMV traversal. So, accesses to each element of the edges array in the topology data are not repeated and cache lines containing these elements show little locality for a number of consecutive accesses.
In Line 4 of Algorithm 1, a memory access is made for reading data of vertex u (D i [u]) which is an in-neighbour of vertex v. So, accessing data of a vertex such as u is repeated in processing each of its out-neighbours. SpMV makes |E| accesses to |V | elements of the data array (|E| |V |). On average, each vertex data is read |E| |V | times; however, accesses to the data of a vertex are dispersed among |E| accesses. In this way, these accesses are called random.

E. Graph Reordering
CPU caches try to keep the recently accessed data in cache to accelerate execution by preventing expensive memory accesses; however, the skewed degree distribution of large realworld graphs results in a huge number of random accesses that cannot be fully satisfied by cache (with limited capacity). So, it is necessary to improve locality of random accesses to accelerate the graph traversal.
Random accesses to data of vertices are specified by (1) the order in which vertices are processed and (2) the number of edges each vertex has. RAs keep the second factor unchangeable and concentrate on the first factor: rearrange the relative order between vertices to consequently change the order of edges, i.e., the order of random accesses.
A RA permutes vertex IDs and receives a graph as its input and creates a relabeling array of size |V | which is indexed by the old ID of a vertex to specify the new ID. Then, the CSC/CSR representations are rebuilt using the relabeling array.

F. Traversal Directions
Parallel traversal of edges of a graph is usually performed in push or pull direction.
In pull direction each vertex reads the old data (D i ) of its in-neighbours and writes its new data (D i+1 ). So, random read memory accesses are made to the old data of vertices.
In push direction, each vertex updates the new data of its out-neighbours by its old data. So, random memory accesses are made for writing the new data of vertices. Push direction has an additional cost for protecting the data of vertices from concurrent updates made by parallel processors. We focus this study on the pull direction which is faster.
Based on the adjacency matrix definition (Section II-A), visiting incoming edges in a pull traversal corresponds to a column-major traversal of the adjacency matrix and visiting outgoing edges in a push traversal corresponds to a row-major traversal. So, a pull traversal uses the CSC format and a push traversal uses the CSR format.
In this paper we study the pull traversal of SpMV, except in Section VII-B where we compare push and pull traversals.

B. Environment
We use a 2-socket machine with 768 GB main memory. Each socket has an Intel ® Xeon ® Gold 6130 with 16 cores, 32KB L1 cache, 1MB L2 cache, and 22MB L3 shared cache. The machine uses CentOS 7 and does not use hyper-threading.
To have a correct evaluation of RAs, it is necessary to reduce the execution overheads of the processing framework because those overheads could swallow up the improvements provided by RAs. A study [50] shows that native hand-optimized implementations of graph analytics are faster than graph processing frameworks by a large performance gap. We used an optimized implementation of SpMV using pthread, libnuma, and papi [51] libraries. It uses the interleaved NUMA memory policy and applies work-stealing [52] for parallel processing of graph partitions created by edge-balanced partitioning [49]. The master-worker model is used for managing parallel threads and futex syscall for thread synchronization. The compiler is gcc-9.2 with -O3 flag. We use 8 Bytes vertex data.

IV. RELABELING ALGORITHMS & LOCALITY TYPES
This section briefly explains the RAs studied in this paper and then introduces locality types. Table II shows the preprocessing time (in Seconds) and memory footprint (in GigaBytes) of the RAs.

A. SlashBurn
SlashBurn (SB) [10] considers the hubs as the main connector between vertices and exploits this feature to detect communities of vertices by removing hubs and finding the connected components (communities). This process continues in the next iteration for the giant connected component (GCC) -the community with the largest number of edges. SB assigns consecutive IDs to hubs of the main graph and the giant communities starting from 0 (based on their degree) and vertices in a community also receive contiguous IDs.
We selected SB as it targets specifically real-world graphs; moreover, it is a representative of degree-ordering RAs. The original implementation of SlashBurn is in MATLAB ® and we implemented a parallel version of SB in the C language that uses "basic hub-ordering" and k = 0.02|V |.

B. Rabbit-Order
Rabbit-Order (RO) [11] develops communities using neighbours of vertices. By starting from the vertices with the lowest degree, it searches for the neighbour with maximum "gain" that can be reached through merging. The gain function is defined as: where w uv is the weight of edge between u and v and deg u is the degree of u. The vertex and its max-gain neighbour are temporarily merged for the purposes of reordering and the weight of the new vertex is calculated as 2w uv + w uu + w vv . After merging two vertices, the weight of their common edges are also added up. The initial weights of a vertex and an edge are 0 and 1, respectively.
The merging process continues while there is a neighbour u of v with ∆Q u,v > 0, otherwise the vertex v is added to the top level set which contains the root of communities. Finally, a parallel Depth First Search (DFS) is performed starting from members of the top level set to assign new IDs.
We selected RO as it is the fastest community detection RA. We used commit f67a79e of Rabbit-Order. RO produces different permutations in different executions and we observed results vary up to ±5 %. One output of RO has been used for all experiments in this paper. RO did not complete its execution for the ClWb9 dataset because of an "out of memory" error.

C. GOrder
GOrder (GO) [2] prioritizes neighbours of vertices by defining a "score" function between two vertices: S(u, v) = S s (u, v)+S n (u, v). The sibling score (S s (u, v)) is the number of common in-neighbours between u and v, and the neighbourhood score(S n (u, v)) is the edges between u and v. GO starts from the vertex with the maximum degree and uses a sliding window to find the vertex with maximum score (between neighbours of recently assigned IDs) to assign the next ID.
We selected GO because of its special algorithm that concentrates on increasing temporal reuse instead of identifying communities. We used commit 7ccdfe9 of GOrder with its default window size (5) that is a single-threaded implementation for graphs with |E| < 2 31 .

D. Locality Types
Considering random memory accesses in Line 4 of Algorithm 1, the following patterns of reuse of vertex data are identified. Each memory access is performed by cache queries and a TLB lookup in VIPT (Virtually Indexed, Physically Tagged) caches. We skip explanation of TLB index reuse for each locality type as it is similar to cache line reuse.
• Type I: The consecutive neighbours of vertex v are close so that accesses to the neighbours benefit from spatial reuse. This means that proximity of IDs of consecutive neighbours results in placing their data on the same cache line that provides reuse in accessing data of neighbours. • Type II: Subsequently processed vertices v and v + δ have common neighbours whose data is temporally reused. If vertex u is a neighbour of v and v + δ, then proximity of IDs allows cache to reuse the data of u in processing v + δ after using it for processing v. • Type III: Subsequently processed vertices v and v + δ have distinct neighbours, but the IDs of the neighbours are close together and causes spatio-temporal reuse. If u is a neighbour of v, and u + θ is a neighbour of v + δ, and θ is small enough that data of u and u + θ are on the same cache line, then proximity of v and v + δ results in reuse of this cache line. • Types IV and V: These types happen for reusing a cache line that has been loaded by another thread into a shared cache: a cache line contains data of vertices u and u + θ and is (re)used in semi-concurrent processing of distinct vertices by different threads. It is type IV, if θ = 0 (similar to type II); otherwise, it is type V (similar to type III).
Types IV and V are not directly targeted by RAs as they mainly depend on (1) vertex partitioning (parallelization) and scheduling of the runtime environment, and (2) availability of the shared caches in the processor architecture. Types I, II and III are determined by the graph and are controlled by RAs.
GO aims for improving type II and III by selecting the vertex with maximum gain based on current contents of cache. RO targets type I and tries to improve clustering based on neighbourhood of vertices that also results in type III. SB aims to improve locality types I and III by identifying clusters, and types II and III by assigning consecutive IDs to hubs.

V. METRICS AND TOOLS
In order to measure spatial locality (type I), we introduce N2N AID degree distribution in Section V-A. Section V-B introduces cache miss rate degree distribution that measures temporal and spatio-temporal locality (types II and III).

A. Neighbour to Neighbour Average ID Distance
Community detection algorithms such as RO try to form clusters based on the neighbourhood of vertices. By assigning consecutive IDs to vertices in the same community, they aim to increase reuse of neighbours' data. To investigate how RAs succeed in bringing neighbours close to each other (spatial locality, type I), we calculate the distance between neighbours.
Using N v,i to show the ID of the i-th neighbour of vertex v (with sorted neighbours list in ascending order), Neighbour to Neighbour Average ID Distance (AID) is defined as: When a RA assigns close IDs to neighbours of a vertex, the difference between IDs of consecutive neighbours is reduced and AID is reduced. In this way, lower AID values, generally, relate to better spatial locality. For a SpMV in the pull direction, AID considers only the in-neighbours of a vertex.
We study the effects of RO on spatial locality of different vertex classes, using AID degree distribution in Section VI-C ( Figure 3). AID degree distribution is computed in O(|E|) time and O(max − degree) space complexities.
It is useful to compare N2N AID to "average gap profile" [23] that calculates average of the differences between the IDs of two endpoints of each edge to provide a summary of the spatial locality of the graph. The neighbours of a vertex do not need to be close to the main vertex as they should be only close to each other to maximize spatial locality.
It is important to note that AID measures clustering efficacy of a RA and is independent of the architecture. In this way, AID is not a deterministic spatial locality metric. As an example, assume a vertex has neighbours with IDs 1600, 3200, and 6400. If a RA changes the IDs of neighbours to 400, 800, and 1600; AID is reduced but the spatial locality is not changed (as the neighbours are still on different cache lines). Consequently, changes in AID are generally sufficient to affect cache or TLB miss rates.

B. Cache Miss Rate Degree Distribution
To collect detailed information about RAs, we collect cache miss rate based on degree of vertices that shows how RAs affect locality types II and III of different vertex classes. We use simulation for this purpose; however, detailed simulation of processor and memory hierarchy (in simulators like Gem5 [55]) is time-consuming for large graphs.
Since graph analytics are memory intensive (in SpMV, there is just an add computation in Algorithm 1, Line 4), we ignore simulating execution of instructions except time-consuming memory instructions (load and store instructions) to make the simulation process efficient and fast.
We designed a trace-based simulator based on the cache simulator of SimpleScalar [56] and equipped it with an accurate implementation of the dueling BRRIP and SRRIP [57] cache replacement policies. We use this implementation for level 3 cache shared between the cores of each NUMA node and for the same configuration (number of sets and ways of associativity) as the real CPU. We instrumented Algorithm 1 at source code level to call the simulator for every load/store.
We performed the parallel simulation in two phases: (1) logging memory accesses during graph processing by each of the parallel threads, and (2) dividing execution duration between threads where for each interval a thread simulates all logged accesses by parallel threads in a round robin way. Figure 1 shows the degree distribution of cache miss rate for RAs. We will interpret these results in Section VI.
For datasets used in this study the average simulation time is 151 seconds. Compared to the real machine, the total cache misses of the simulation has an average 15% error, and the average relative error (for comparing misses between two RAs) is 1.4%. This means that differences greater than 1.4% between miss rates of relabeled versions of a graph in Figure 1 are valid.

VI. LOCALITY ANALYSIS OF RAS
A. Locality Analysis of SlashBurn SB has been designed for power-law graphs: "We propose to envision graphs as a collection of hubs connecting spokes, with super-hubs connecting the hubs, and so on, recursively" [10]. The main idea is to iteratively remove hubs of power-law graphs; however, the practicality of this method depends on whether power-law graphs are destroyed recursively.
To assess this theory we depict the degree distribution of GCC for different iterations of SB in Figure 2. Over different iterations of SB, the degree distribution of the GCC does not maintain the power-law property.
After a few iterations, the remaining network shows an almost-uniform degree distribution with low degrees. Further iterations of the SB separate these LDV from their neighbours in what are perceived as different communities. As a result, neighbours are assigned widely distinct IDs that reduces locality types I and III.
SB is partly similar to degree-ordering as a number of HDV receive initial consecutive IDs that increases temporal reuse (types II and III) in accessing data of out-hubs.
SB improves locality types IV and V (Section VI-F).

B. Locality Analysis of GOrder
GO tries to optimize locality by maximizing reuse of the current content of the cache (types II and III). It uses a sliding window and searches for a neighbour with the greatest score (Section IV-C). For a HDV in the sliding window the sibling score is dominant and the vertex with more common in-neighbours will have more chance to be selected. For a LDV in the sliding window the neighbourhood score is dominant.
GO considers common neighbours with only a limited number of already-placed vertices (a window size of the past 5 vertices). There are numerous LDV in power-law graphs and many of them appear equally "close" to the 5 last labeled vertices. As such, GOrder cannot properly distinguish which LDV to select. This is reflected in the cache miss rate (Figure 1) where GOrder decreases cache miss rate well on HDV but cannot perform well for LDV.
To further investigate GOrder's strategy towards HDV, we use cache simulation to count the number of misses occur in accessing data of HDV. Table III shows that GO and SB have the lowest reloads of HDV. For Twitter MPI and Friendster SB has lower reloads of vertices with degree > 2000; but, for vertices with degree > 20, GO has the lower reloads. As such, GOrder increases the number of reloads of HDV to provide space in cache for LDV (to reduce its reloads). The latter are exponentially more frequent in power-law graphs.
As we explained in Section VI-A, degree-ordering in SB keeps data of out-hubs in cache; but, the score function of GO selects vertices with more temporal reuse based on the    current contents of the cache and prevents filling cache with HDV. In other words, GO allocates cache space to vertices with lower degree but with more temporal reuse in short durations of processing and in this way, GOrder reduces the presence of HDV in the cache to increase the total reuse. As a consequence, GO fills the cache with vertices of different degrees (but with more temporal reuse) rather than dedicating the cache capacity to vertices with the highest degree.

C. Locality Analysis of Rabbit-Order
RO builds communities bottom-up and starts from low degree vertices and merges neighbouring vertices while trying to maximize the gain function (SectionIV-B). This results in a set of trees that reflect the communities and are used in the second phase, to assign IDs by DFS traversal of each tree.
We use degree distribution of AID (Section V-A) to assess changes made by RO in spatial locality. As Figures 1 and 3 show, Rabbit-Order reduces AID of LDV and improves their spatial locality by using DFS in the second phase that assigns spatially close IDs between neighbouring LDV in clusters. However, as degree of vertices is increased, DFS cannot assign consecutive IDs to the neighbours (because each neighbour has itself a number of neighbours). So, AID and cache miss rate of Rabbit-Order are increased for HDV. Figure 1 shows that all RAs incur higher miss rates for hub vertices. Processing an in-hub requires accessing data of several in-neighbours, and only a fraction of that data exist in cache. For other ones, memory accesses are required. While RAs change the order of edges of hubs, they cannot change the topology of graph. So, locality of hubs is not improved by RAs as much as other vertices. Locality of hubs is important as they dedicate a large fraction of edges, i.e., a large fraction of traversal time to themselves and this observation demonstrates that hubs of real-world graphs suffer from a structural problem in relation to locality that cannot be solved by RAs.

D. Observation on Hubs
Moreover, the data fetched from memory for processing an in-hub may be flushed before the end of processing that inhub, as the data of subsequent neighbours of that in-hub (that again are missed in cache), should be read from memory and be placed in cache. In this way, reuse of cache contents is reduced as a side effect of processing in-hubs. Table IV shows the real execution of SpMV. The number of misses and idle time are averaged between threads.

E. Real Execution Performance Metrics
Last level (L3) cache misses shows the number of memory accesses that are not satisfied by caches and sent to main memory. The number of L3 misses is the main locality metric. Table IV shows that SB usually destroys locality and increases the execution time. GO reduces L3 misses and execution time of social networks. RO improves locality of web graphs and reduces their performance.
DTLB misses specify misses that occur in looking up translations of virtual addresses to physical addresses. While a DTLB miss results in (possibly multiple) memory accesses, DTLB misses are not usually a bottleneck as the total size of huge memory pages that are cached by TLB is much greater than the aggregate CPU cache capacity. DTLB misses show locality of RA at larger granularity, i.e., at longer reuse distances than L3 misses.
RO interleaves HDV between LDV during the ID assignment phase. By applying DFS on independent clusters whose data are placed in a few memory pages, Rabbit-Order minimizes intra-cluster edges that reduces DTLB misses.
Idle time shows the average percentage of execution time that each thread is idle. Comparison between RO and the baseline for UU in Table IV shows that RO reduces L3 misses, but the execution time is not better than the baseline. Increased idle time is one of the reasons and shows that improving locality does not necessarily translate to improved performance.
Since RAs do not evenly change the locality of consecutive vertices (as partitions that are assigned to or stolen by threads), as Table IV shows, improving locality of a graph dataset by a RA may increase the idle time.

F. How Much of Cache Capacity Is Effectively Used?
We introduce the term Effective Cache Size (ECS) as "the percentage of cache capacity dedicated to caching randomly accessed data". In SpMV (Algorithm 1), this measures the proportion of cache used to cache D i . It is important as cache lines of topology data are sequentially accessed and have a limited reuse. So, there is no merit in keeping topology data in cache; but, randomly accessed vertex data are reused and dedicating more cache space is beneficial to performance.
We use functional (timing-less) simulation to estimate ECS. We periodically scan the cache contents during execution to identify cache lines containing old data of vertices. Table V shows the results: RAs do not utilize all capacity of the cache to satisfy random memory accesses.
Moreover, SB usually has the greatest ECS while it makes the most cache misses (Figure 1 and Table IV). In other words, by reducing locality, the effective cache size is increased.
To explain this, we need to review the arrangement of vertices in the SB algorithm. By separating LDV from their parents in the last iterations of SB, the locality types I and III of LDV are reduced. This means more memory requests are performed and cache lines with lower reuse are evicted faster (as a greater number of new cache lines are fetched from the main memory and should be placed somewhere in the cache). Therefore, cache lines of topology data are removed faster from cache and number of cache lines of vertex data is increased.
To have a better illustration, we compare this status to when all random memory accesses are clustered on a small number of vertex data because of better locality. So, only a fraction of cache capacity is dedicated to those frequently accessed vertices and ECS is reduced. Comparison of L3 misses in Table IV and ECS in Table V also shows that the RA with the best locality for a dataset usually has the lowest ECS.
This observation has an important repercussion for hardware design: the full cache capacity remains unused in the current state of the art. So, improving locality will mean caches are even more over-sized. Moreover, we need algorithms capable of deploying all capacity of the cache.
Increasing ECS in SB results in filling cache with a great number of vertex data and locality types IV and V are improved in processing numerous neighbours of hubs. So, as Figure 1 shows, the miss rate of hubs is reduced by SlashBurn.

A. Web Graphs vs. Social Networks
Table IV shows that the RA that performs well for social networks is GO and for web graphs, it is RO. Section VI-B explains that GO improves locality of HDV and Section VI-C demonstrates how RO improves locality of LDV. So, HDV of social networks and LDV of web graphs are the main sources of improving locality by GO and RO, respectively.
To explain this, we compare the connection between HDV in social networks and web graphs by defining the Asymmetricity of a vertex as the fraction of in-neighbours that are not out-neighbours: Figure 4 compares the degree distribution of asymmetricity of TwtrMpi (as a social network) to UK-Union (as a web graph). It shows that TwtrMpi has highly symmetric vertices with high in-degrees. In other words, in-hubs are almost symmetric in social networks (in-hubs are out-hubs), while web graphs do not have symmetric in-hubs.
For further investigation, we analyze the connection between vertices by defining degree classes: "1-10", "10-100", "100-1K", ... . Figure 5 represents the degree range decomposition as the correlation between the degrees of neighbouring vertices: all edges to vertices in a degree class are binned based on the degree class of their source vertex. E.g., vertices with in-degree between 10-100 in TwtrMpi receive 29% of their incoming edges from vertices with out-degree 100K-1M.
For vertices with degree greater than 1K in TwtrMpi, HDV form more than half of the neighbours, while in SK-Domain LDV are dominant in forming neighbours of HDV. This shows that HDV have close connection to each other in social networks. On the other hand, LDV are the main constituents of all degree classes of the web graphs.
For this tight connection of HDV in social networks, RO cannot form independent clusters (with relatively small number of intra-cluster edges) and therefore RO cannot improve locality of the HDV. Table III also shows that RO has the most reloads, but GO manages hubs based on their temporal reuse (Section IV-C). In this way, GOrder optimizes reuse of a large number of fully connected HDV of social networks that cannot be kept simultaneously in the cache by giving priority to temporal reuse of vertices with lower degree.
On the other hand, web graphs do not have a tight connection between HDV and the important factor for locality is spatial locality between low-degree neighbours. As a result, Rabbit-Order efficiently groups LDV to reduce AID and improves their locality (Figures 1 and 3).

B. Push Locality vs. Pull Locality
Section II-F explained push and pull traversal directions. In this section we explain how different datasets make benefit from a special traversal direction.
Push and pull traversals differ in two aspects: (1) using CSC in pull and CSR in push, and (2) reading data of vertices in pull and writing it in push. So, the comparison of push and pull traversals should be performed in two steps: (1) investigating the impact of CSC and CSR formats of the graph by considering the same operation (e.g. read) for both  formats (instead of read in CSC and write in CSR), and (2) identifying how read and write instructions affect the CSC and CSR traversals. The second step depends on the analytic algorithm, so, we concentrate on the the first step to understand the impacts of different real-world graphs on locality of push and pull traversals. Table VI compares CSC and CSR traversals for the read operation, i.e., each vertex makes a sum of data of its in-neighbours (in CSC traversal) and its out-neighbours (in CSR traversal). It shows that there is a fundamental difference: web graphs have faster CSR traversal, but CSC traversal is faster for social networks.
To explain the differences in CSR and CSC locality, we study the structure of power-law graphs. The effect of hubs becomes more important in CSR and CSC traversals by noting two points discussed in Section VII-A: (1) real-world graphs may have both in-hubs and out-hubs or only one of them, and (2) in-hubs are not always out-hubs. Moreover, in a pull traversal using CSC format, out-hubs have a constructive effect on locality as their data is frequently accessed and is reused in processing several vertices; but, in a push traversal using CSR in-hubs are locality improving.
In order to explain locality of push and pull traversals, we consider the number of edges that are processed by keeping H hubs with maximum degrees in the cache. This shows what fraction of total edges (as an indicator of total processing) is covered (completed) by these H hubs. Figure 6 illustrates the percentage of edges covered by hubs while increasing the number of hubs for a social network (Twitter MPI) and a web graph (SK-Domain). Figure 6 shows that pull traversal of Twitter MPI can process 44% of edges by keeping 100K out-hubs in the cache, but push traversal can process about 23%. For SK-Domain it is vice versa, and pull traversal can process only 4% of the edges, while push traversal can process 64% of edges. We found the same trend across all graphs of the same types.
This shows that web graphs benefit from push locality as they have more powerful in-hubs than out-hubs, while social networks benefit from pull locality because of their more powerful out-hubs.

VIII. OPTIMIZING LOCALITY AND RAS A. Optimizing Locality and Memory Accesses
Section VI-D showed that RAs are incapable of improving locality of hub vertices. To counter this, iHTL [58] presents a SpMV traversal to optimize locality of in-hubs in real-world graphs. iHTL creates dense flipped blocks (sub-graphs) that contain edges to in-hubs and processes them in push direction, while processing the sparse block in the pull direction.
In contrast to RAs that are not able to effectively utilize cache (Section VI-F), iHTL specifies the number of in-hubs by considering the cache size. In this way, cache capacity utilization is optimized in processing flipped blocks.
Section II-E showed two general methods for improving locality of random accesses: (1) changing the order of vertices, and (2) rearrangement of edges. The former is used by RAs and the latter is deployed by iHTL by creating a number of sub-graphs to exploit locality in processing in-hub vertices.
Thrifty Label Propagation [59] presents a structure-aware Connected Components by reducing memory accesses in processing hubs of power-law graphs.
B. Improving RAs 1) SlashBurn: In Section VI-A we explained that the GCC of SB does not have a power-law degree distribution after some initial iterations and contains a network of LDV. So, the next iterations destroy the neighbourhood of LDV. To counter this, we propose a variation on SB (called SlashBurn++), that continues the iterations while GCC-max-degree ≥ |V |. SlashBurn++ reduces preprocessing time, traversal time, and L3 misses (Table VII).
2) Rabbit-Order: Degree distribution of cache miss rate (Figures 1) can be used to identify an efficacy degree range (EDR) that for vertices in this range, the RA improves locality. We can skip relabeling vertices that are not in EDR to reduce the preprocessing time and memory space: during  relabeling we pass only edges of those vertices to the RA that their degree is within the EDR. For other vertices, we let the labels be determined in the same manner as zero degree vertices. By applying this technique to RO and for two datasets, we experienced reduction in preprocessing time without affecting the traversal time. For Frndstr the preprocessing time reduced from 139 to 103 seconds, and for TwtrMpi from 66 to 12 seconds.

C. Further Suggestions as Future Work
Section VI-B showed that GO improves locality of HDV, while RO improves locality of LDV (Section VI-C). Therefore a new RA can merge Rabbit-Order and GOrder techniques to improve locality of both of LDV and HDV. Such an RA may start from LDV like RO to build initial clusters and then switch to a method like GO to relabel HDV.
We observed that GO cannot improve locality of LDV because of its fixed size of sliding window. It can be improved by dynamically changing size of sliding window based on the contents of the window. Moreover, the vertex selection policy of GOrder can be improved, for example by selecting the vertex with the highest percentage of neighbours that can be processed by traversing prior vertices in the sliding window.
In Section VI-A, we saw ECS is affected by RAs. However, RAs are cache-oblivious algorithms [60], [61] and do not take the cache size into account. RAs can be improved by considering caching parameters of the execution machine(s): • SB can specify the number of hubs and therefore the number of its iterations based on the cache size, • GO can use cache size to identify its window size, and • RO also can use cache size as an indicator of the maximum number of vertices in a community which prevents increasing size of communities indefinitely (Section VII-A).

A. Locality Optimizing Algorithms
Space filling curves improve locality without relabeling the graph. These techniques have first been investigated for dense linear algebra [62]. More recently they have been applied to graph processing [63], [64]. They are most easily applied in a coordinate list representation of the graph.
In [16], graph relabeling is used to provide better locality for neighbour vertices and therefore to provide better graph compression.

B. Evaluation of RAs
The impacts of RAs on different graph analytics have been studied in [20]- [23]; however, these studies do not reveal details of RAs and how they affect locality of graphs. This paper is the first one that investigates the functionality of RAs based on different vertex classes.

X. CONCLUSION
This paper introduces a number of techniques to efficiently analyze graph reordering algorithms (RA) and their effects on real-world graphs. We classified locality types to enrich the terminology required for the discussion and we presented an accurate graph-specific simulation technique that allowed us to investigate locality conditional on the degree of vertices. We presented N2N AID as a spatial locality metric.
Using these techniques and metrics we studied three stateof-the-art locality optimizing RAs: SlashBurn, GOrder, and Rabbit-Order to identify how they affect locality of different vertices. Moreover, we presented a structural analysis of real-world graphs that explains the contrasting behaviours of datasets in relation to RAs. We identified a tight network of high-degree vertices in social networks that suffers from temporal locality and we discussed the functionality of GOrder that enhances temporal locality of these datasets. Analysis of web graphs showed that their spatial locality is improved by clustering low-degree vertices in Rabbit-Order. Effective cache size introduced as a metric of cache capacity utilization and we see it is reduced as locality is improved by RAs.
We also studied differences in locality of push and pull traversals as consequences of the structure of datasets and showed that web graphs benefit from push locality but social networks benefit from pull locality. This reveals the necessity of considering the structure of datasets in selecting a suitable direction for processing and also in interpreting results.
Finally, we presented some immediate improvements to RAs based on our study and also expressed further suggestions that need more fundamental research.

ONLINE WEB PAGE
Further discussions relating to this paper are available online on https://blogs.qub.ac.uk/GraphProcessing/ Locality-Analysis-of-Graph-Reordering-Algorithms/.