Using replication for resilience on exascale systems

Marin Bougeret 1 Henri Casanova 2 Yves Robert 3, 4 Frédéric Vivien 3 Dounia Zaidouni 3
1 MAORE - Méthodes Algorithmes pour l'Ordonnancement et les Réseaux
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier
3 GRAAL - Algorithms and Scheduling for Distributed Heterogeneous Platforms
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : High performance computing applications must be tolerant to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-rollback, by which the application saves its state to secondary storage throughout execution and recover from the latest saved state in case of a failure. An oft studied research question is that of the optimal checkpointing strategy: when should checkpoints be saved. Unfortunately, even using an optimal checkpointing strategy, the frequency of checkpointing must increase as platform scale increases, leading to higher checkpointing overhead. This overhead precludes high parallel efficiency for large-scale platforms, thus mandating other more scalable fault-tolerance mechanisms. One such mechanism is replication, which can be used in addition to checkpoint-rollback. Using replication, multiple processors perform the same computation so that a processor failure does not necessarily mean application failure. While at first glance replication may seem wasteful, it may be significantly more efficient than using solely checkpoint-rollback at large scale. In this work we investigate two approaches for replication. In the first approach, each process in a single instance of a parallel application is (transparently) replicated. In the second approach, entire application instances are replicated. We provide a theoretical study of these two approaches, comparing them to checkpoint-rollback only, in terms of expected application execution time.
Type de document :
Rapport
[Research Report] RR-7830, INRIA. 2011
Liste complète des métadonnées

Littérature citée [20 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00650325
Contributeur : Frédéric Vivien <>
Soumis le : jeudi 15 décembre 2011 - 13:41:10
Dernière modification le : mercredi 18 juillet 2018 - 11:52:05
Document(s) archivé(s) le : jeudi 30 mars 2017 - 20:46:19

Fichier

RR-7830.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00650325, version 2

Citation

Marin Bougeret, Henri Casanova, Yves Robert, Frédéric Vivien, Dounia Zaidouni. Using replication for resilience on exascale systems. [Research Report] RR-7830, INRIA. 2011. 〈hal-00650325v2〉

Partager

Métriques

Consultations de la notice

403

Téléchargements de fichiers

231