Skip to Main content Skip to Navigation
Conference papers

Computing the expected makespan of task graphs in the presence of silent errors

Abstract : Applications structured as Directed Acyclic Graphs (DAGs) of tasks correspond to a general model of parallel computation that occurs in many domains, including popular scientific workflows. DAG scheduling has received an enormous amount of attention, and several list-scheduling heuristics have been proposed and shown to be effective in practice. Many of these heuristics make scheduling decisions based on path lengths in the DAG. At large scale, however, compute platforms and thus tasks are subject to various types of failures with no longer negligible probabilities of occurrence. Failures that have recently received increasing attention are " silent errors, " which cause a task to produce incorrect results even though it ran to completion. Tolerating silent errors is done by checking the validity of the results and re-executing the task from scratch in case of an invalid result. The execution time of a task then becomes a random variable, and so are path lengths. Unfortunately, computing the expected makespan of a DAG (and equivalently computing expected path lengths in a DAG) is a computationally difficult problem. Consequently, designing effective scheduling heuristics is preconditioned on computing accurate approximations of the expected makespan. In this work we propose an algorithm that computes a first order approximation of the expected makespan of a DAG when tasks are subject to silent errors. We compare our proposed approximation to previously proposed such approximations for three classes of application graphs from the field of numerical linear algebra. Our evaluations quantify approximation error with respect to a ground truth computed via a brute-force Monte Carlo method. We find that our proposed approximation outperforms previously proposed approaches, leading to large reductions in approximation error for low (and realistic) failure rates, while executing much faster.
Complete list of metadata

Cited literature [40 references]  Display  Hide  Download
Contributor : Equipe Roma Connect in order to contact the contributor
Submitted on : Friday, August 19, 2016 - 11:50:22 AM
Last modification on : Friday, September 30, 2022 - 4:12:13 AM
Long-term archiving on: : Sunday, November 20, 2016 - 10:14:54 AM


Files produced by the author(s)


  • HAL Id : hal-01354711, version 1



Henri Casanova, Julien Herrmann, Yves Robert. Computing the expected makespan of task graphs in the presence of silent errors. Ninth International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), 2016, Aug 2016, Philadelphia, United States. ⟨hal-01354711⟩



Record views


Files downloads