Dynamic Load Balancing and Elasticity in a Distributed Heterogeneous Task-based Dataflow Runtime

Date

Authors

John, Joseph

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

For over three decades, scientific computing software has heavily relied on process-centric distributed programming models. In this paradigm, individual processes divide the workload between nodes, and threads further divide the tasks within each node. This model proved efficient when hardware was homogeneous, and applications were regular and balanced. However, with the advent of hardware heterogeneity and increased application irregularity, process-centric model is exhibiting significant limitations. As computing nodes become more efficient at managing asynchronous events, thus enhancing the capacity to better scale applications, it becomes imperative for programming models to identify and expose asynchronicity within applications. Nevertheless, process-centric programming models often struggle to fully exploit this asynchronicity. Another critical drawback of the process-centric programming model is the degree of low-level decisions that a programmer must engage in, such as data distribution, GPU offloading, synchronization, and communication. Compounding these challenges is the absence of automatic load balancing in process-centric models. To meet the current requirements, we need a programming model capable of managing multi-node applications while effectively managing the heterogeneity among individual computing nodes. In this regard, the task-based dataflow programming model has emerged as a viable alternative to the process-centric programming model for extreme-scale applications. In this model, the application is conceptualized as a collection of tasks with data flowing between them. Decomposing the problem at the task level helps expose more asynchronicity within an application. An additional benefit of the task-based programming model is the shift of low-level responsibilities from the programmer to the runtime. Despite these advantages, dynamic automatic load balancing remains a challenge in task-based dataflow programming models. This thesis employs the task-based dataflow programming model, PaRSEC, to investigate the challenges associated with dynamic and automatic load balancing. This research delves into the realm of distributed load balancing and provides an in-depth analysis of design and policy choices that impact application performance when migrating tasks from one compute node to another. The research demonstrates that distributed work stealing, even without having to collect any load information, is an effective load-balancing mechanism and introduces new work stealing strategies that consider future tasks in the decision-making process for work stealing. Further, this research addresses the unique challenges and solutions related to load balancing among different GPUs within a compute node. In today's high-performance computing systems, multi-GPU computing nodes have become integral. Despite notable strides in programmability, efficiently harnessing the power of all GPUs in a computing node remains a substantial challenge. This thesis demonstrates, for the first time, the enhancement of application performance in a task-based dataflow runtime by implementing work sharing among GPUs within a computing node, with a specific focus on optimizing GPU utilization. Additionally, this research introduces the first instance of migrating tasks whose data is already resident in GPU memory to another GPU, showing that such task migration incurs no disadvantages. Further, this thesis explores the mechanism of distributed work stealing to migrate tasks intended for GPUs across compute nodes. This research demonstrates the utility of this work stealing approach, for the first time, as a mechanism to implement elasticity within the task-based dataflow runtime. Through these investigations, this thesis aims to contribute valuable insights into overcoming challenges associated with load balancing in complex high-performance computing environments.

Description

Keywords

Citation

Source

Book Title

Entity type

Access Statement

License Rights

Restricted until

Downloads

File
Description