Learning to control large-scale parallel platforms

Valentin Reis 1, 2
2 DATAMOVE - Data Aware Large Scale Computing
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
Abstract : Providing the computational infrastucture needed to solve complex problemsarising in modern society is a strategic challenge. Organisations usuallyadress this problem by building extreme-scale parallel and distributedplatforms. High Performance Computing (HPC) vendors race for more computingpower and storage capacity, leading to sophisticated specific Petascaleplatforms, soon to be Exascale platforms. These systems are centrally managedusing dedicated software solutions called Resource and Job Management Systems(RJMS). A crucial problem adressed by this software layer is the job schedulingproblem, where the RJMS chooses when and on which resources computational taskswill be executed. This manuscript provides ways to adress this schedulingproblem. No two platforms are identical. Indeed, the infrastructure, userbehavior and organization's goals all change from one system to the other. Wetherefore argue that scheduling policies should be adaptative to the system'sbehavior. In this manuscript, we provide multiple ways to achieve thisadaptativity. Through an experimental approach, we study various tradeoffsbetween the complexity of the approach, the potential gain, and the riskstaken.
Complete list of metadatas

Cited literature [93 references]  Display  Hide  Download

https://hal.inria.fr/tel-01965150
Contributor : Abes Star <>
Submitted on : Friday, February 1, 2019 - 11:12:30 AM
Last modification on : Wednesday, July 10, 2019 - 1:29:40 AM
Long-term archiving on : Thursday, May 2, 2019 - 2:49:48 PM

File

REIS_2018_archivage.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01965150, version 2

Citation

Valentin Reis. Learning to control large-scale parallel platforms. Distributed, Parallel, and Cluster Computing [cs.DC]. Université Grenoble Alpes, 2018. English. ⟨NNT : 2018GREAM045⟩. ⟨tel-01965150v2⟩

Share

Metrics

Record views

103

Files downloads

587