Skip to Main content Skip to Navigation
Conference papers

Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms

Abstract : In this paper, we design and analyze strategies to replicate the execution of an application on two diierent platforms subject to failures, using checkpointing on a shared stable storage. We derive the optimal paaern size W for a periodic checkpointing strategy where both platforms concurrently try and executeW units of work before checkpointing. e rst platform that completes its paaern takes a checkpoint, and the other platform interrupts its execution to synchronize from that checkpoint. We compare this strategy to a simpler on-failure checkpointing strategy, where a checkpoint is taken by one platform only whenever the other platform encounters a failure. We use rst or second-order approximations to compute overheads and optimal paaern sizes, and show through extensive simulations that these models are very accurate. e simulations show the usefulness of a secondary platform to reduce execution time, even when the platforms have relatively diierent speeds: in average, over a wide range of scenarios, the overhead is reduced by 30%. e simulations also demonstrate that the periodic checkpointing strategy is globally more eecient, unless platform speeds are quite close.
Complete list of metadata

Cited literature [20 references]  Display  Hide  Download

https://hal.inria.fr/hal-02082847
Contributor : Equipe Roma <>
Submitted on : Thursday, March 28, 2019 - 2:59:01 PM
Last modification on : Monday, November 16, 2020 - 9:58:10 AM
Long-term archiving on: : Saturday, June 29, 2019 - 2:31:29 PM

File

replication.pdf
Files produced by the author(s)

Identifiers

Collections

Citation

Anne Benoit, Aurélien Cavelan, Valentin Le Fèvre, Yves Robert. Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms. 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale FTXS, Jun 2017, Washington, United States. pp.9-16, ⟨10.1145/3086157.3086165⟩. ⟨hal-02082847⟩

Share

Metrics

Record views

90

Files downloads

333