Skip to Main content Skip to Navigation
New interface
Reports (Research report)

Using replication for resilience on exascale systems

Marin Bougeret 1 Henri Casanova 2 Yves Robert 3, 4 Frédéric Vivien 3 Dounia Zaidouni 3 
1 MAORE - Methods, Algorithms for Operations REsearch
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier
3 GRAAL - Algorithms and Scheduling for Distributed Heterogeneous Platforms
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : High performance computing applications must be tolerant to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-rollback, by which the application saves its state to secondary storage throughout execution and recover from the latest saved state in case of a failure. An oft studied research question is that of the optimal checkpointing strategy: when should checkpoints be saved. Unfortunately, even using an optimal checkpointing strategy, the frequency of checkpointing must increase as platform scale increases, leading to higher checkpointing overhead. This overhead precludes high parallel efficiency for large-scale platforms, thus mandating other more scalable fault-tolerance mechanisms. One such mechanism is replication, which can be used in addition to checkpoint-rollback. Using replication, multiple processors perform the same computation so that a processor failure does not necessarily mean application failure. While at first glance replication may seem wasteful, it may be significantly more efficient than using solely checkpoint-rollback at large scale. In this work we investigate two approaches for replication. In the first approach, each process in a single instance of a parallel application is (transparently) replicated. In the second approach, entire application instances are replicated. We provide a theoretical study of these two approaches, comparing them to checkpoint-rollback only, in terms of expected application execution time.
Document type :
Reports (Research report)
Complete list of metadata

Cited literature [20 references]  Display  Hide  Download
Contributor : Frédéric Vivien Connect in order to contact the contributor
Submitted on : Thursday, December 15, 2011 - 1:41:10 PM
Last modification on : Friday, November 18, 2022 - 9:26:44 AM
Long-term archiving on: : Thursday, March 30, 2017 - 8:46:19 PM


Files produced by the author(s)


  • HAL Id : hal-00650325, version 2


Marin Bougeret, Henri Casanova, Yves Robert, Frédéric Vivien, Dounia Zaidouni. Using replication for resilience on exascale systems. [Research Report] RR-7830, INRIA. 2011. ⟨hal-00650325v2⟩



Record views


Files downloads