Road to exascale: Improving scheduling performances and reducing energy consumption with the help of end-users

David Glesser 1, 2, 3, 4
2 MOAIS - PrograMming and scheduling design fOr Applications in Interactive Simulation
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
4 DATAMOVE - Data Aware Large Scale Computing
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
Abstract : The field of High Performance Computing (HPC) is characterized by the continuous evolution of computing architectures, the proliferation of computing resources and the increasing complexity of applications users wish to solve. One of the most important software of the HPC stack is the Resource and Job Management System (RJMS) which stands between the user workloads and the platform, the applications and the resources. This specialized software provides functions for building,submitting, scheduling and monitoring jobs in a dynamic and complex computing environment. In order to reach exaflops HPC systems, new constraints and objectives have been introduced. This thesis develops and tests the idea that the users of such systems can help reaching the exaflopic scale. Specifically, we show and introduce new techniques that employ users behaviors to improve energy consumption and overall cluster performances. To test the proposed techniques, we need to develop new tools and methodologies that scale up to large HPC clusters. Thus, we designed adequate tools that assess new RJMS scheduling algorithms of such large systems. These tools are able to run on small clusters by emulating or simulating bigger platforms. After evaluating different techniques to measure the energy consumption of HPC clusters, we propose a new heuristic, based on the popular Easy Backfilling algorithm, in order to control the power consumption of such huge systems. We also demonstrate, using the same idea, how to control the energy consumption during a time period. The proposed mechanism is able to limit the energy consumption while keeping satisfying performances. If energy is a limited resource, it has to be shared fairly. We also present a mechanism which shares energy consumption among users. We argue that sharing fairly the energy among users should motivate them to reduce the energy consumption of their applications. Finally, we analyze past and present behaviors of users using learning algorithms in order to improve the performances of the parallel platforms. This approach does not only outperform state of the art methods, it also shows promising insight on how such method can improve other aspects of RJMS.
Complete list of metadatas

https://hal.inria.fr/tel-01425620
Contributor : Grégory Mounié <>
Submitted on : Tuesday, January 3, 2017 - 4:48:08 PM
Last modification on : Monday, February 25, 2019 - 4:34:17 PM
Long-term archiving on : Tuesday, April 4, 2017 - 2:42:36 PM

Identifiers

  • HAL Id : tel-01425620, version 1

Citation

David Glesser. Road to exascale: Improving scheduling performances and reducing energy consumption with the help of end-users. Distributed, Parallel, and Cluster Computing [cs.DC]. Univ. Grenoble Alpes, 2016. English. ⟨tel-01425620⟩

Share

Metrics

Record views

634

Files downloads

332