Resilient Scheduling of Moldable Jobs on Failure-Prone Platforms - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2020

Resilient Scheduling of Moldable Jobs on Failure-Prone Platforms

Résumé

This paper focuses on the resilient scheduling of moldable parallel jobs on high-performance computing (HPC) platforms. Moldable jobs allow for choosing a processor allocation before execution, and their execution time obeys various speedup models. The objective is to minimize the overall completion time of the jobs, or makespan, assuming that jobs are subject to arbitrary failure scenarios, and hence need to be re-executed each time they fail until successful completion. This work generalizes the classical framework where jobs are known offline and do not fail. We introduce a list-based algorithm, and prove new approximation ratios for three prominent speedup models (roofline, communication, Amdahl). We also introduce a batch-based algorithm, where each job is allowed a restricted number of failures per batch, and prove a new approximation ratio for the arbitrary speedup model. We conduct an extensive set of simulations to evaluate and compare different variants of the two algorithms. The results show that they consistently outperform some baseline heuristics. In particular, the list algorithm performs better for the roofline and communication models, while the batch algorithm has better performance for the Amdahl's model. Overall, our best algorithm is within a factor of 1.47 of a lower bound on average over the whole set of experiments, and within a factor of 1.8 in the worst case.
Fichier principal
Vignette du fichier
moldable_cluster_hal.pdf (791.38 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03028773 , version 1 (30-11-2020)

Identifiants

  • HAL Id : hal-03028773 , version 1

Citer

Anne Benoit, Valentin Le Fèvre, Lucas Perotin, Padma Raghavan, Yves Robert, et al.. Resilient Scheduling of Moldable Jobs on Failure-Prone Platforms. CLUSTER 2020 - IEEE International Conference on Cluster Computing, Sep 2020, Kobe, Japan. pp.1-29. ⟨hal-03028773⟩
51 Consultations
158 Téléchargements

Partager

Gmail Facebook X LinkedIn More