Skip to Main content Skip to Navigation
Journal articles

A Generic Approach to Scheduling and Checkpointing Workflows

Abstract : This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MINMIN and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CKPTALL), which is an overkill when failures are rare events, and checkpointing no task (CKPTNONE), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as M-SPGS (Minimal Series-Parallel Graphs). Extensive experiments report significant gain over both CKPTALL and CKPTNONE, for a wide variety of workflows.
Complete list of metadata

Cited literature [46 references]  Display  Hide  Download
Contributor : Equipe Roma Connect in order to contact the contributor
Submitted on : Monday, May 27, 2019 - 10:57:04 AM
Last modification on : Thursday, September 29, 2022 - 2:58:07 PM


Files produced by the author(s)



Li Han, Valentin Le Fèvre, Louis-Claude Canon, Yves Robert, Frédéric Vivien. A Generic Approach to Scheduling and Checkpointing Workflows. International Journal of High Performance Computing Applications, SAGE Publications, 2019, pp.1-19. ⟨10.1177/1094342019866891⟩. ⟨hal-02140295⟩



Record views


Files downloads