Skip to Main content Skip to Navigation
Reports

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors

Abstract : Large-scale platforms currently experience errors from two different sources, namely fail-stop errors (which interrupt the execution) and silent errors (which strike unnoticed and corrupt data). This work combines checkpointing and replication for the reliable execution of linear workflows on platforms subject to these two error types. While checkpointing and replication have been studied separately, their combination has not yet been investigated despite its promising potential to minimize the execution time of linear workflows in error-prone environments. Moreover, combined checkpointing and replication has not yet been studied in the presence of both fail-stop and silent errors. The combination raises new problems: for each task, we have to decide whether to checkpoint and/or replicate it to ensure its reliable execution. We provide an optimal dynamic programming algorithm of quadratic complexity to solve both problems. This dynamic programming algorithm has been validated through extensive simulations that reveal the conditions in which checkpointing only, replication only, or the combination of both techniques, lead to improved performance.
Complete list of metadata

Cited literature [53 references]  Display  Hide  Download

https://hal.inria.fr/hal-01955859
Contributor : Equipe Roma <>
Submitted on : Friday, December 14, 2018 - 4:24:35 PM
Last modification on : Monday, November 16, 2020 - 9:58:14 AM
Long-term archiving on: : Friday, March 15, 2019 - 4:57:23 PM

File

RR-9235.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01955859, version 1

Collections

Citation

Anne Benoit, Aurélien Cavelan, Florina Ciorba, Valentin Le Fèvre, Yves Robert. Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors. [Research Report] RR-9235, ROMA (INRIA Rhône-Alpes / LIP Laboratoire de l’Informatique du Parallélisme); LIP - Laboratoire de l’Informatique du Parallélisme. 2018, pp.1-32. ⟨hal-01955859⟩

Share

Metrics

Record views

142

Files downloads

469