This paper proposes software-defined floating-point number formats for graph processing workloads, which can improve performance in irregular workloads by reducing cache misses. Efficient arithmetic on software-defined number formats is challenging, even when based on conversion to wider, hardware-supported formats. We derive efficient conversion schemes that are tuned to the IA64 and AVX512 instruction sets.
We demonstrate that: (i) reduced-precision number formats can be applied to graph processing without loss of accuracy; (ii) conversion of floating-point values is possible
with minimal instructions; (iii) conversions are most efficient when utilizing vectorized instruction sets, specifically on IA64 processors.
Experiments on twelve real-world graph data sets demonstrate that our techniques result in speedups up to 89% for PageRank and Accelerated PageRank, and up to 35% for Single-Source Shortest Paths. The same techniques help to accelerate the integer-based maximal independent set problem by up to 262%.
Vectorization seeks to accelerate computation through data-level parallelism. Vectorization has been applied to graph processing, where the graph is traversed either in a push style or a pull style. As it is not well understood which style will perform better, there is a need for both vectorized push and pull style traversals. This paper is the first to present a general solution to vectorizing push style traversal. It more-over presents an enhanced vectorized pull style traversal.
Our solution consists of three components: CleanCut, a graph partitioning approach that rules out inter-thread race conditions; VectorFast, a compact graph representation that supports fast-forwarding through the edge stream; and Graptor, a domain-specific language and compiler for auto-vectorizing and optimizing graph processing codes.
Experimental evaluation demonstrates average speedups of 2.72X over Ligra, 2.46X over GraphGrind, and 2.33X over GraphIt. Graptor outperforms Grazelle, which performs vectorized pull style graph processing, by 4.05 times.