Skip to Main content Skip to Navigation
New interface

Learning to control large-scale parallel platforms

Abstract : Providing the computational infrastucture needed to solve complex problemsarising in modern society is a strategic challenge. Organisations usuallyadress this problem by building extreme-scale parallel and distributedplatforms. High Performance Computing (HPC) vendors race for more computingpower and storage capacity, leading to sophisticated specific Petascaleplatforms, soon to be Exascale platforms. These systems are centrally managedusing dedicated software solutions called Resource and Job Management Systems(RJMS). A crucial problem adressed by this software layer is the job schedulingproblem, where the RJMS chooses when and on which resources computational taskswill be executed. This manuscript provides ways to adress this schedulingproblem. No two platforms are identical. Indeed, the infrastructure, userbehavior and organization's goals all change from one system to the other. Wetherefore argue that scheduling policies should be adaptative to the system'sbehavior. In this manuscript, we provide multiple ways to achieve thisadaptativity. Through an experimental approach, we study various tradeoffsbetween the complexity of the approach, the potential gain, and the riskstaken.
Complete list of metadata

Cited literature [153 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Friday, February 1, 2019 - 11:12:30 AM
Last modification on : Wednesday, July 6, 2022 - 4:18:00 AM
Long-term archiving on: : Thursday, May 2, 2019 - 2:49:48 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01965150, version 2


Valentin Reis. Learning to control large-scale parallel platforms. Distributed, Parallel, and Cluster Computing [cs.DC]. Université Grenoble Alpes, 2018. English. ⟨NNT : 2018GREAM045⟩. ⟨tel-01965150v2⟩



Record views


Files downloads