Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs

This paper focuses on the resilient scheduling of parallel jobs on high-performance computing (HPC) platforms to minimize the overall completion time, or makespan. We revisit the classical problem while assuming that jobs are subject to transient or silent errors, and hence may need to be re-executed each time they fail to complete successfully. This work generalizes the classical framework where jobs are known offline and do not fail: in the classical framework, list scheduling that gives priority to longest jobs is known to be a 3-approximation when imposing to use shelves, and a 2-approximation without this restriction. We show that when jobs can fail, using shelves can be arbitrarily bad, but unrestricted list scheduling remains a 2-approximation. The paper focuses on the design of several heuristics, some list-based and some shelf-based, along with different priority rules and backfilling strategies. We assess and compare their performance through an extensive set of simulations, using both synthetic jobs and log traces from the Mira supercomputer.

Domaines

Informatique [cs]

Fichier principal

rigid_apdcm_hal.pdf (747.78 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Equipe Roma : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03029842

Soumis le : lundi 30 novembre 2020-20:58:31

Dernière modification le : mardi 6 février 2024-11:13:30

Archivage à long terme le : lundi 1 mars 2021-18:15:34

Dates et versions

hal-03029842 , version 1 (30-11-2020)

Identifiants

HAL Id : hal-03029842 , version 1

Citer

Anne Benoit, Valentin Le Fèvre, Padma Raghavan, Yves Robert, Hongyang Sun. Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs. APDCM 2020 - Workshop on Advances in Parallel and Distributed Computational Models (colocated with IPDPS), May 2020, New Orleans, LA, United States. pp.1-27. ⟨hal-03029842⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS INRIA UNIV-LYON1 INRIA2 UDL

23 Consultations

71 Téléchargements