Composing resilience techniques: ABFT, periodic and incremental checkpointing

Abstract : Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only practical fault-tolerance approach for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for the effective protection of an iterative application composed of ABFT- aware and ABFT-unaware sections. We also consider an incremental checkpointing composite approach in which the algorithmic knowledge is leveraged by a novel optimal dynamic program- ming to compute checkpoint dates. We validate these models using a simulator. The model and simulator show that the composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing the means to increase the interval between checkpoints while simultaneously decreasing the volume of each checkpoint.
Type de document :
Article dans une revue
International Journal of Networking and Computing, Higashi Hiroshima : Dept. of Computer Engineering, Hiroshima University, 2015, 5 (1), pp.2-25
Liste complète des métadonnées

https://hal.inria.fr/hal-01091930
Contributeur : Equipe Roma <>
Soumis le : dimanche 7 décembre 2014 - 21:06:14
Dernière modification le : vendredi 20 avril 2018 - 15:44:27

Identifiants

  • HAL Id : hal-01091930, version 1

Collections

Citation

George Bosilca, Aurélien Bouteiller, Thomas Hérault, Yves Robert, Jack Dongarra. Composing resilience techniques: ABFT, periodic and incremental checkpointing. International Journal of Networking and Computing, Higashi Hiroshima : Dept. of Computer Engineering, Hiroshima University, 2015, 5 (1), pp.2-25. 〈hal-01091930〉

Partager

Métriques

Consultations de la notice

336