Skip to Main content Skip to Navigation
Conference papers

Resilient Scheduling of Moldable Jobs on Failure-Prone Platforms

Abstract : This paper focuses on the resilient scheduling of moldable parallel jobs on high-performance computing (HPC) platforms. Moldable jobs allow for choosing a processor allocation before execution, and their execution time obeys various speedup models. The objective is to minimize the overall completion time of the jobs, or makespan, assuming that jobs are subject to arbitrary failure scenarios, and hence need to be re-executed each time they fail until successful completion. This work generalizes the classical framework where jobs are known offline and do not fail. We introduce a list-based algorithm, and prove new approximation ratios for three prominent speedup models (roofline, communication, Amdahl). We also introduce a batch-based algorithm, where each job is allowed a restricted number of failures per batch, and prove a new approximation ratio for the arbitrary speedup model. We conduct an extensive set of simulations to evaluate and compare different variants of the two algorithms. The results show that they consistently outperform some baseline heuristics. In particular, the list algorithm performs better for the roofline and communication models, while the batch algorithm has better performance for the Amdahl's model. Overall, our best algorithm is within a factor of 1.47 of a lower bound on average over the whole set of experiments, and within a factor of 1.8 in the worst case.
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/hal-03028773
Contributor : Equipe Roma <>
Submitted on : Monday, November 30, 2020 - 9:18:35 AM
Last modification on : Thursday, December 3, 2020 - 1:43:48 PM

File

moldable_cluster_hal.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-03028773, version 1

Collections

Citation

Anne Benoit, Valentin Le Fèvre, Lucas Perotin, Padma Raghavan, Yves Robert, et al.. Resilient Scheduling of Moldable Jobs on Failure-Prone Platforms. CLUSTER 2020 - IEEE International Conference on Cluster Computing, Sep 2020, Kobe, Japan. pp.1-29. ⟨hal-03028773⟩

Share

Metrics

Record views

12

Files downloads

49