Skip to Main content Skip to Navigation
Conference papers

Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms

Abstract : In this paper, we design and analyze strategies to replicate the execution of an application on two diierent platforms subject to failures, using checkpointing on a shared stable storage. We derive the optimal paaern size W for a periodic checkpointing strategy where both platforms concurrently try and executeW units of work before checkpointing. e rst platform that completes its paaern takes a checkpoint, and the other platform interrupts its execution to synchronize from that checkpoint. We compare this strategy to a simpler on-failure checkpointing strategy, where a checkpoint is taken by one platform only whenever the other platform encounters a failure. We use rst or second-order approximations to compute overheads and optimal paaern sizes, and show through extensive simulations that these models are very accurate. e simulations show the usefulness of a secondary platform to reduce execution time, even when the platforms have relatively diierent speeds: in average, over a wide range of scenarios, the overhead is reduced by 30%. e simulations also demonstrate that the periodic checkpointing strategy is globally more eecient, unless platform speeds are quite close.
Complete list of metadata

Cited literature [20 references]  Display  Hide  Download
Contributor : Equipe Roma <>
Submitted on : Thursday, March 28, 2019 - 2:59:01 PM
Last modification on : Monday, November 16, 2020 - 9:58:10 AM
Long-term archiving on: : Saturday, June 29, 2019 - 2:31:29 PM


Files produced by the author(s)




Anne Benoit, Aurélien Cavelan, Valentin Le Fèvre, Yves Robert. Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms. 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale FTXS, Jun 2017, Washington, United States. pp.9-16, ⟨10.1145/3086157.3086165⟩. ⟨hal-02082847⟩



Record views


Files downloads