Skip to Main content Skip to Navigation

Using replication for resilience on exascale systems

Abstract : High performance computing applications must be tolerant to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-rollback, by which the application saves its state to secondary storage throughout execution and recover from the latest saved state in case of a failure. An oft studied research question is that of the optimal checkpointing strategy: when should checkpoints be saved. Unfortunately, even using an optimal checkpointing strategy, the frequency of checkpointing must increase as platform scale increases, leading to higher checkpointing overhead. This overhead precludes high parallel efficiency for large-scale platforms, thus mandating other more scalable fault-tolerance mechanisms. One such mechanism is replication, which can be used in addition to checkpoint-rollback. Using replication, multiple processors perform the same computation so that a processor failure does not necessarily mean application failure. While at first glance replication may seem wasteful, it may be significantly more efficient than using solely checkpoint-rollback at large scale. In this work we investigate two approaches for replication. In the first approach, each process in a single instance of a parallel application is (transparently) replicated. In the second approach, entire application instances are replicated. We provide a theoretical study of these two approaches, comparing them to checkpoint-rollback only, in terms of expected application execution time.
Complete list of metadata
Contributor : Frédéric Vivien Connect in order to contact the contributor
Submitted on : Friday, December 9, 2011 - 5:27:40 PM
Last modification on : Tuesday, October 19, 2021 - 11:54:55 AM
Long-term archiving on: : Sunday, December 4, 2016 - 10:50:54 PM


Files produced by the author(s)


  • HAL Id : hal-00650325, version 1


Marin Bougeret, Henri Casanova, Yves Robert, Frédéric Vivien, Dounia Zaidouni. Using replication for resilience on exascale systems. [Research Report] RR-7830, 2011. ⟨hal-00650325v1⟩



Record views


Files downloads