HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Journal articles

Computing the expected makespan of task graphs in the presence of silent errors

Abstract : Applications structured as Directed Acyclic Graphs (DAGs) of tasks occur in many domains, including popular scientific workflows. DAG scheduling has thus received an enormous amount of attention. Many of the popular DAG scheduling heuristics make scheduling decisions based on path lengths. At large scale compute platforms are subject to various types of failures with non-negligible probabilities of occurrence. Failures that have recently received increased attention are “silent errors,” which cause data corruption. Tolerating silent errors is done by checking the validity of computed results and possibly re-executing a task from scratch. The execution time of a task then becomes a random variable, and so do path lengths in a DAG. Unfortunately, computing the expected makespan of a DAG (and equivalently computing expected path lengths in a DAG) is a computationally difficult problem. Consequently, designing effective scheduling heuristics in this context is challenging. In this work, we propose an algorithm that computes a first order approximation of the expected makespan of a DAG when tasks are subject to silent errors. We find that our proposed approximation outperforms previously proposed approaches both in terms of approximation error and of speed.
Document type :
Journal articles
Complete list of metadata

Cited literature [46 references]  Display  Hide  Download

https://hal.inria.fr/hal-01968433
Contributor : Equipe Roma Connect in order to contact the contributor
Submitted on : Thursday, January 24, 2019 - 6:18:39 PM
Last modification on : Monday, May 16, 2022 - 4:46:02 PM

File

parco_2016.pdf
Files produced by the author(s)

Identifiers

Collections

Citation

Henri Casanova, Julien Herrmann, Yves Robert. Computing the expected makespan of task graphs in the presence of silent errors. Parallel Computing, Elsevier, 2018, 75, pp.41-60. ⟨10.1016/j.parco.2018.03.004⟩. ⟨hal-01968433⟩

Share

Metrics

Record views

42

Files downloads

111