Resilient Scheduling of Moldable Jobs on Failure-Prone Platforms - Archive ouverte HAL Access content directly
Conference Papers Year :

Resilient Scheduling of Moldable Jobs on Failure-Prone Platforms

(1, 2) , (1, 2) , (1, 2, 3) , (3) , (1, 2, 4) , (3)
1
2
3
4

Abstract

This paper focuses on the resilient scheduling of moldable parallel jobs on high-performance computing (HPC) platforms. Moldable jobs allow for choosing a processor allocation before execution, and their execution time obeys various speedup models. The objective is to minimize the overall completion time of the jobs, or makespan, assuming that jobs are subject to arbitrary failure scenarios, and hence need to be re-executed each time they fail until successful completion. This work generalizes the classical framework where jobs are known offline and do not fail. We introduce a list-based algorithm, and prove new approximation ratios for three prominent speedup models (roofline, communication, Amdahl). We also introduce a batch-based algorithm, where each job is allowed a restricted number of failures per batch, and prove a new approximation ratio for the arbitrary speedup model. We conduct an extensive set of simulations to evaluate and compare different variants of the two algorithms. The results show that they consistently outperform some baseline heuristics. In particular, the list algorithm performs better for the roofline and communication models, while the batch algorithm has better performance for the Amdahl's model. Overall, our best algorithm is within a factor of 1.47 of a lower bound on average over the whole set of experiments, and within a factor of 1.8 in the worst case.
Fichier principal
Vignette du fichier
moldable_cluster_hal.pdf (791.38 Ko) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-03028773 , version 1 (30-11-2020)

Identifiers

  • HAL Id : hal-03028773 , version 1

Cite

Anne Benoit, Valentin Le Fèvre, Lucas Perotin, Padma Raghavan, Yves Robert, et al.. Resilient Scheduling of Moldable Jobs on Failure-Prone Platforms. CLUSTER 2020 - IEEE International Conference on Cluster Computing, Sep 2020, Kobe, Japan. pp.1-29. ⟨hal-03028773⟩
39 View
118 Download

Share

Gmail Facebook Twitter LinkedIn More