Computing the expected makespan of task graphs in the presence of silent errors

Abstract : Applications structured as Directed Acyclic Graphs (DAGs) of tasks correspond to a general model of parallel computation that occurs in many domains, including popular scientific workflows. DAG scheduling has received an enormous amount of attention, and several list-scheduling heuristics have been proposed and shown to be effective in practice. Many of these heuristics make scheduling decisions based on path lengths in the DAG. At large scale, however, compute platforms and thus tasks are subject to various types of failures with no longer negligible probabilities of occurrence. Failures that have recently received increasing attention are " silent errors, " which cause a task to produce incorrect results even though it ran to completion. Tolerating silent errors is done by checking the validity of the results and re-executing the task from scratch in case of an invalid result. The execution time of a task then becomes a random variable, and so are path lengths. Unfortunately, computing the expected makespan of a DAG (and equivalently computing expected path lengths in a DAG) is a computationally difficult problem. Consequently, designing effective scheduling heuristics is preconditioned on computing accurate approximations of the expected makespan. In this work we propose an algorithm that computes a first order approximation of the expected makespan of a DAG when tasks are subject to silent errors. We compare our proposed approximation to previously proposed such approximations for three classes of application graphs from the field of numerical linear algebra. Our evaluations quantify approximation error with respect to a ground truth computed via a brute-force Monte Carlo method. We find that our proposed approximation outperforms previously proposed approaches, leading to large reductions in approximation error for low (and realistic) failure rates, while executing much faster.
Type de document :
Communication dans un congrès
Ninth International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), 2016, Aug 2016, Philadelphia, United States. Ninth International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), 2016
Liste complète des métadonnées

Littérature citée [40 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01354711
Contributeur : Equipe Roma <>
Soumis le : vendredi 19 août 2016 - 11:50:22
Dernière modification le : vendredi 20 avril 2018 - 15:44:27
Document(s) archivé(s) le : dimanche 20 novembre 2016 - 10:14:54

Fichier

icpp_2016.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01354711, version 1

Collections

Citation

Henri Casanova, Julien Herrmann, Yves Robert. Computing the expected makespan of task graphs in the presence of silent errors. Ninth International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), 2016, Aug 2016, Philadelphia, United States. Ninth International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), 2016. 〈hal-01354711〉

Partager

Métriques

Consultations de la notice

268

Téléchargements de fichiers

121