{"id":521,"date":"2021-04-14T14:33:09","date_gmt":"2021-04-14T13:33:09","guid":{"rendered":"https:\/\/blogs.qub.ac.uk\/dipsa\/scheduling-fine-grain-loops-in-graph-processing-workloads\/"},"modified":"2021-04-14T14:33:09","modified_gmt":"2021-04-14T13:33:09","slug":"scheduling-fine-grain-loops-in-graph-processing-workloads","status":"publish","type":"post","link":"https:\/\/blogs.qub.ac.uk\/dipsa\/scheduling-fine-grain-loops-in-graph-processing-workloads\/","title":{"rendered":"Scheduling Fine-Grain Loops in Graph Processing Workloads"},"content":{"rendered":"\n<p>Scheduling or distributing the computational workload over multiple threads is a critical and repeatedly performed activity in graph processing workloads. In a recent paper &#8220;<a href=\"https:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/cpe.6241\">Reducing the burden of parallel loop schedulers for many\u2010core processors<\/a>&#8221; published in Concurrency &amp; Computation: Practice &amp; Experience, we investigated the overhead introduced by scheduling. This overhead follows from two effects: (i) threads require to communicate and arrive at the same point in the program at the same time; (ii) inter-thread communication incurs significant cache misses and coherence messages sent between processors. We have likened the work distribution to barrier synchronisation and observed that state-of-the-art parallel schedulers such as the Intel OpenMP runtime and Intel Cilkplus incur the cost of a full-barrier synchronisation at the start of a parallel loop and at the end of the loop. The below figure illustrates the synchronisation pattern:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/blogs.qub.ac.uk\/DIPSA\/wp-content\/uploads\/sites\/14\/2023\/02\/fine-grain-loop-1.png\" alt=\"\" class=\"wp-image-536\" \/><\/figure>\n\n\n\n<p>A barrier synchronisation is a synchronisation mechanism that waits for all threads to arrive at the barrier, then signals each thread they may continue execution. If we look in more detail at a barrier, it consists of two phases: a join phase and a release phase:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/blogs.qub.ac.uk\/DIPSA\/wp-content\/uploads\/sites\/14\/2023\/02\/fine-grain-loop-2.png\" alt=\"\" class=\"wp-image-537\" \/><\/figure>\n\n\n\n<p>However, this introduces redundant synchronisation. It suffices to place only a half-barrier synchronisation at the start of the loop, and the other half at the end of the loop. Schematically, this looks like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/blogs.qub.ac.uk\/DIPSA\/wp-content\/uploads\/sites\/14\/2023\/02\/fine-grain-loop-3.png\" alt=\"\" class=\"wp-image-538\" \/><\/figure>\n\n\n\n<p>Based on this observation, we designed an optimised scheduling technique that works specifically well for fine-grain loops, which are typically counted loops with very short loop bodies.<\/p>\n\n\n\n<p>Using our optimised scheduler, fine-grain loops in graph processing applications can be sped up by 21.6% to 29.6%. The below figure shows a histogram of the performance obtained for the fine-grain loops in the betweenness centrality kernel (BC). This evaluation was performed on a four-socket 2.6 GHz Intel Xeon E7-4860 v2 machine with 12 physical cores per socket (plus hyperthreading) and30 MB L3 cache per socket. The baseline uses the Intel Cilkplus scheduler, while hybrid demonstrates performance of a hybrid version of the Cilkplus scheduler which can execute a mixture of coarse-grain loops (scheduled using the normal Cilkplus policy) and fine-grain loops using our optimised scheduler.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/blogs.qub.ac.uk\/DIPSA\/wp-content\/uploads\/sites\/14\/2023\/02\/fine-grain-loop-4-1.png\" alt=\"\" class=\"wp-image-541\" \/><\/figure>\n\n\n\n<p>As graph processing applications contain a mix of fine-grain and coarse-grain loop, overall speedups in these applications is below 5%.<\/p>\n\n\n\n<p>More details can be found in the paper, published under Open Access: <a href=\"https:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/cpe.6241\">https:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/cpe.6241<\/a> <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Scheduling or distributing the computational workload over multiple threads is a critical and repeatedly performed activity in graph processing workloads. In a recent paper &#8220;Reducing the burden of parallel loop schedulers for many\u2010core processors&#8221; published in Concurrency &amp; Computation: Practice &amp; Experience, we investigated the overhead introduced by scheduling. This overhead follows from two effects: [&hellip;]<\/p>\n","protected":false},"author":974,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[24],"tags":[],"class_list":{"0":"post-521","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-graphgrind","7":"czr-hentry"},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/posts\/521","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/users\/974"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/comments?post=521"}],"version-history":[{"count":0,"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/posts\/521\/revisions"}],"wp:attachment":[{"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/media?parent=521"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/categories?post=521"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.qub.ac.uk\/dipsa\/wp-json\/wp\/v2\/tags?post=521"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}