Learning to control large-scale parallel platforms

Valentin Reis 1, 2
2 DATAMOVE - Data Aware Large Scale Computing
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
Abstract : Providing the computational infrastucture needed to solve complex problemsarising in modern society is a strategic challenge. Organisations usuallyadress this problem by building extreme-scale parallel and distributedplatforms. High Performance Computing (HPC) vendors race for more computingpower and storage capacity, leading to sophisticated specific Petascaleplatforms, soon to be Exascale platforms. These systems are centrally managedusing dedicated software solutions called Resource and Job Management Systems(RJMS). A crucial problem adressed by this software layer is the job schedulingproblem, where the RJMS chooses when and on which resources computational taskswill be executed. This manuscript provides ways to adress this schedulingproblem. No two platforms are identical. Indeed, the infrastructure, userbehavior and organization's goals all change from one system to the other. Wetherefore argue that scheduling policies should be adaptative to the system'sbehavior. In this manuscript, we provide multiple ways to achieve thisadaptativity. Through an experimental approach, we study various tradeoffsbetween the complexity of the approach, the potential gain, and the riskstaken.
Complete list of metadatas

Cited literature [153 references]  Display  Hide  Download

Contributor : Abes Star <>
Submitted on : Friday, February 1, 2019 - 11:12:30 AM
Last modification on : Friday, October 25, 2019 - 1:25:19 AM
Long-term archiving on: Thursday, May 2, 2019 - 2:49:48 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01965150, version 2


Valentin Reis. Learning to control large-scale parallel platforms. Distributed, Parallel, and Cluster Computing [cs.DC]. Université Grenoble Alpes, 2018. English. ⟨NNT : 2018GREAM045⟩. ⟨tel-01965150v2⟩



Record views


Files downloads