Skip to Main content Skip to Navigation
Conference papers

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows

Abstract : This paper combines checkpointing and replication for the reliable execution of linear workflows. While both methods have been studied separately, their combination has not yet been investigated despite its promising potential to minimize the execution time of linear workflows in failure-prone environments. The combination raises new problems: for each task, we have to decide whether to checkpoint and/or replicate it. We provide an optimal dynamic programming algorithm of quadratic complexity to solve both problems. This dynamic programming algorithm has been validated through extensive simulations that reveal the conditions in which checkpointing only, replication only, or the combination of both techniques lead to improved performance.
Complete list of metadatas

Cited literature [43 references]  Display  Hide  Download

https://hal.inria.fr/hal-01963655
Contributor : Equipe Roma <>
Submitted on : Friday, December 21, 2018 - 2:39:18 PM
Last modification on : Wednesday, February 26, 2020 - 11:14:28 AM
Long-term archiving on: : Friday, March 22, 2019 - 4:57:41 PM

File

apdcm-camready.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01963655, version 1

Collections

Citation

Anne Benoit, Aurélien Cavelan, Florina Ciorba, Valentin Le Fèvre, Yves Robert. Combining Checkpointing and Replication for Reliable Execution of Linear Workflows. APDCM 2018 - 20th Workshop on Advances in Parallel and Distributed Computational Models workshop, in conjunction with IPDPS'18, May 2018, Vancouver, Canada. pp.1-10. ⟨hal-01963655⟩

Share

Metrics

Record views

202

Files downloads

233