HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Reports

Optimal checkpointing period with replicated execution on heterogeneous platforms

Abstract : In this paper, we design and analyze strategies to replicate the execution of an application on two different platforms subject to failures, using checkpointing on a shared stable storage. We derive the optimal pattern size~$W$ for a periodic checkpointing strategy where both platforms concurrently try and execute $W$ units of work before checkpointing. The first platform that completes its pattern takes a checkpoint, and the other platform interrupts its execution to synchronize from that checkpoint. We compare this strategy to a simpler on-failure checkpointing strategy, where a checkpoint is taken by one platform only whenever the other platform encounters a failure. We use first or second-order approximations to compute overheads and optimal pattern sizes, and show through extensive simulations that these models are very accurate. The simulations show the usefulness of a secondary platform to reduce execution time, even when the platforms have relatively different speeds: in average, over a wide range of scenarios, the overhead is reduced by $30\%$. The simulations also demonstrate that the periodic checkpointing strategy is globally more efficient, unless platform speeds are quite close.
Document type :
Reports
Complete list of metadata

https://hal.inria.fr/hal-01504936
Contributor : Equipe Roma Connect in order to contact the contributor
Submitted on : Monday, April 10, 2017 - 4:33:44 PM
Last modification on : Monday, May 16, 2022 - 4:46:02 PM
Long-term archiving on: : Tuesday, July 11, 2017 - 2:28:23 PM

File

rr9055inria.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01504936, version 1

Collections

Citation

Anne Benoit, Aurélien Cavelan, Valentin Le Fèvre, Yves Robert. Optimal checkpointing period with replicated execution on heterogeneous platforms. [Research Report] RR-9055, INRIA. 2017. ⟨hal-01504936⟩

Share

Metrics

Record views

130

Files downloads

95