Skip to Main content Skip to Navigation
Reports

A Generic Approach to Scheduling and Checkpointing Workflows

Abstract : This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as M-SPGs (Minimal Series-Parallel Graphs). Extensive experiments report significant gain over both CkptAll and CkptNone, for a wide variety of workflows.
Document type :
Reports
Complete list of metadatas

Cited literature [37 references]  Display  Hide  Download

https://hal.inria.fr/hal-01766352
Contributor : Equipe Roma <>
Submitted on : Friday, April 13, 2018 - 3:04:07 PM
Last modification on : Wednesday, September 16, 2020 - 10:42:49 AM

File

RR-9167.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01766352, version 1

Citation

Li Han, Valentin Fèvre, Louis-Claude Canon, Yves Robert, Frédéric Vivien. A Generic Approach to Scheduling and Checkpointing Workflows. [Research Report] RR-9167, Inria. 2018, pp.1-29. ⟨hal-01766352v1⟩

Share

Metrics

Record views

133

Files downloads

50