HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information

# Optimal checkpointing period with replicated execution on heterogeneous platforms

2 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : In this paper, we design and analyze strategies to replicate the execution of an application on two different platforms subject to failures, using checkpointing on a shared stable storage. We derive the optimal pattern size~$W$ for a periodic checkpointing strategy where both platforms concurrently try and execute $W$ units of work before checkpointing. The first platform that completes its pattern takes a checkpoint, and the other platform interrupts its execution to synchronize from that checkpoint. We compare this strategy to a simpler on-failure checkpointing strategy, where a checkpoint is taken by one platform only whenever the other platform encounters a failure. We use first or second-order approximations to compute overheads and optimal pattern sizes, and show through extensive simulations that these models are very accurate. The simulations show the usefulness of a secondary platform to reduce execution time, even when the platforms have relatively different speeds: in average, over a wide range of scenarios, the overhead is reduced by $30\%$. The simulations also demonstrate that the periodic checkpointing strategy is globally more efficient, unless platform speeds are quite close.
Keywords :
Document type :
Reports
Domain :

https://hal.inria.fr/hal-01504936
Contributor : Equipe Roma Connect in order to contact the contributor
Submitted on : Monday, April 10, 2017 - 4:33:44 PM
Last modification on : Monday, May 16, 2022 - 4:46:02 PM
Long-term archiving on: : Tuesday, July 11, 2017 - 2:28:23 PM

### File

rr9055inria.pdf
Files produced by the author(s)

### Identifiers

• HAL Id : hal-01504936, version 1

### Citation

Anne Benoit, Aurélien Cavelan, Valentin Le Fèvre, Yves Robert. Optimal checkpointing period with replicated execution on heterogeneous platforms. [Research Report] RR-9055, INRIA. 2017. ⟨hal-01504936⟩

Record views