Skip to Main content Skip to Navigation
Reports

Using group replication for resilience on exascale systems

Marin Bougeret 1 Henri Casanova 2 Yves Robert 3, * Frédéric Vivien 3, * Dounia Zaidouni 3, *
* Corresponding author
1 MAORE - Méthodes Algorithmes pour l'Ordonnancement et les Réseaux
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier
3 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : High performance computing applications must be resilient to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-recovery, by which the application saves its state to secondary storage throughout execution and recovers from the latest saved state in case of a failure. An oft studied research question is that of the optimal checkpointing strategy: when should state be saved? Unfortunately, even using an optimal checkpointing strategy, the checkpointing frequency must increase as platform scale increases, leading to higher checkpointing overhead. This overhead precludes high parallel efficiency for large-scale platforms, thus mandating other more scalable fault-tolerance mechanisms. One such mechanism is replication, which can be used in addition to checkpoint-recovery. Using replication, multiple processors perform the same computation so that a processor failure does not necessarily imply application failure. While at first glance replication may seem wasteful, it may be significantly more efficient than using solely checkpoint-recovery at large scale. In this work we investigate a simple approach where entire application instances are replicated. We provide a theoretical study of checkpoint-recovery with replication in terms of expected application execution time, under an exponential distribution of failures. We design dynamic-programming based algorithms to define checkpointing dates that work under any failure distribution. We also conduct simulation experiments assuming that failures follow Exponential or Weibull distributions, the latter being more representative of real-world systems, and using failure logs from production clusters. Our results show that replication is useful in a variety of realistic application and checkpointing cost scenarios for future exascale platforms.
Complete list of metadatas

Cited literature [28 references]  Display  Hide  Download

https://hal.inria.fr/hal-00668016
Contributor : Frédéric Vivien <>
Submitted on : Friday, June 28, 2013 - 10:55:22 PM
Last modification on : Thursday, June 11, 2020 - 8:56:02 PM
Long-term archiving on: : Wednesday, April 5, 2017 - 4:59:39 AM

File

RR-7876_AugmentedVersion.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00668016, version 2

Citation

Marin Bougeret, Henri Casanova, Yves Robert, Frédéric Vivien, Dounia Zaidouni. Using group replication for resilience on exascale systems. [Research Report] RR-7876, INRIA. 2012. ⟨hal-00668016v2⟩

Share

Metrics

Record views

780

Files downloads

839