Dynamic Load Balancing and Elasticity in a Distributed Heterogeneous Task-based Dataflow Runtime

dc.contributor.authorJohn, Joseph
dc.date.accessioned2025-05-14T01:59:44Z
dc.date.available2025-05-14T01:59:44Z
dc.date.issued2025
dc.description.abstractFor over three decades, scientific computing software has heavily relied on process-centric distributed programming models. In this paradigm, individual processes divide the workload between nodes, and threads further divide the tasks within each node. This model proved efficient when hardware was homogeneous, and applications were regular and balanced. However, with the advent of hardware heterogeneity and increased application irregularity, process-centric model is exhibiting significant limitations. As computing nodes become more efficient at managing asynchronous events, thus enhancing the capacity to better scale applications, it becomes imperative for programming models to identify and expose asynchronicity within applications. Nevertheless, process-centric programming models often struggle to fully exploit this asynchronicity. Another critical drawback of the process-centric programming model is the degree of low-level decisions that a programmer must engage in, such as data distribution, GPU offloading, synchronization, and communication. Compounding these challenges is the absence of automatic load balancing in process-centric models. To meet the current requirements, we need a programming model capable of managing multi-node applications while effectively managing the heterogeneity among individual computing nodes. In this regard, the task-based dataflow programming model has emerged as a viable alternative to the process-centric programming model for extreme-scale applications. In this model, the application is conceptualized as a collection of tasks with data flowing between them. Decomposing the problem at the task level helps expose more asynchronicity within an application. An additional benefit of the task-based programming model is the shift of low-level responsibilities from the programmer to the runtime. Despite these advantages, dynamic automatic load balancing remains a challenge in task-based dataflow programming models. This thesis employs the task-based dataflow programming model, PaRSEC, to investigate the challenges associated with dynamic and automatic load balancing. This research delves into the realm of distributed load balancing and provides an in-depth analysis of design and policy choices that impact application performance when migrating tasks from one compute node to another. The research demonstrates that distributed work stealing, even without having to collect any load information, is an effective load-balancing mechanism and introduces new work stealing strategies that consider future tasks in the decision-making process for work stealing. Further, this research addresses the unique challenges and solutions related to load balancing among different GPUs within a compute node. In today's high-performance computing systems, multi-GPU computing nodes have become integral. Despite notable strides in programmability, efficiently harnessing the power of all GPUs in a computing node remains a substantial challenge. This thesis demonstrates, for the first time, the enhancement of application performance in a task-based dataflow runtime by implementing work sharing among GPUs within a computing node, with a specific focus on optimizing GPU utilization. Additionally, this research introduces the first instance of migrating tasks whose data is already resident in GPU memory to another GPU, showing that such task migration incurs no disadvantages. Further, this thesis explores the mechanism of distributed work stealing to migrate tasks intended for GPUs across compute nodes. This research demonstrates the utility of this work stealing approach, for the first time, as a mechanism to implement elasticity within the task-based dataflow runtime. Through these investigations, this thesis aims to contribute valuable insights into overcoming challenges associated with load balancing in complex high-performance computing environments.
dc.identifier.urihttps://hdl.handle.net/1885/733750355
dc.language.isoen_AU
dc.titleDynamic Load Balancing and Elasticity in a Distributed Heterogeneous Task-based Dataflow Runtime
dc.typeThesis (PhD)
local.contributor.affiliationCollege of Systems and Society, The Australian National University
local.contributor.supervisorPotanin, Alex
local.identifier.doi10.25911/KWXM-ZX15
local.identifier.proquestYes
local.identifier.researcherID
local.mintdoimint
local.thesisANUonly.author3319239a-5e05-4ac1-843b-aac0040b365d
local.thesisANUonly.key1c7bc80c-27e1-12c4-e95b-d93ef911ff11
local.thesisANUonly.title000000021298_TC_1

Downloads

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
u6779084_Thesis_corrected.pdf
Size:
11.95 MB
Format:
Adobe Portable Document Format
Description:
Thesis Material