Skip to Main content Skip to Navigation
Journal articles

Multi-level checkpointing and silent error detection for linear workflows

Anne Benoit 1 Aurélien Cavelan 1 Yves Robert 1 Hongyang Sun 1
1 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : We focus on High Performance Computing (HPC) workflows whose dependency graph forms a linear chain, and we extend single-level checkpointing in two important directions. Our first contribution targets silent errors, and combines in-memory checkpoints with both partial and guaranteed verifications. Our second contribution deals with multi-level checkpointing for fail-stop errors. We present sophisticated dynamic programming algorithms that return the optimal solution for each problem in polynomial time. We also show how to combine all these techniques and solve the problem with both fail-stop and silent errors. Simulation results demonstrate that these extensions lead to significantly improved performance compared to the standard single-level checkpointing algorithm.
Complete list of metadatas

Cited literature [50 references]  Display  Hide  Download

https://hal.inria.fr/hal-02082408
Contributor : Equipe Roma <>
Submitted on : Thursday, March 28, 2019 - 2:04:37 PM
Last modification on : Wednesday, February 26, 2020 - 11:14:08 AM
Long-term archiving on: : Saturday, June 29, 2019 - 2:39:08 PM

File

jocs-revised.pdf
Files produced by the author(s)

Identifiers

Collections

Citation

Anne Benoit, Aurélien Cavelan, Yves Robert, Hongyang Sun. Multi-level checkpointing and silent error detection for linear workflows. Journal of computational science, Elsevier, 2018, 28, pp.398-415. ⟨10.1016/j.jocs.2017.03.024⟩. ⟨hal-02082408⟩

Share

Metrics

Record views

93

Files downloads

202