Assessing the Impact of ABFT and Checkpoint Composite Strategies

Abstract : Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution when the data is protected by it's own intrinsic properties, and can be algorith-mically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for effective protection of an iterative application composed of ABFT-aware and ABFT-unaware sections. We validate this model using a simulator. The model and simulator show that this composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be checkpointed.
Type de document :
Communication dans un congrès
16th Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2014), May 2014, Phoenix, United States. IEEE, pp.10, 2014, Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International 〈10.1109/IPDPSW.2014.79〉
Liste complète des métadonnées

Littérature citée [28 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01354689
Contributeur : Equipe Roma <>
Soumis le : vendredi 19 août 2016 - 11:24:08
Dernière modification le : vendredi 20 avril 2018 - 15:44:27
Document(s) archivé(s) le : dimanche 20 novembre 2016 - 10:38:56

Fichier

apdcm.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

George Bosilca, Aurelien Bouteiller, Thomas Herault, Yves Robert, Jack Dongarra. Assessing the Impact of ABFT and Checkpoint Composite Strategies. 16th Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2014), May 2014, Phoenix, United States. IEEE, pp.10, 2014, Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International 〈10.1109/IPDPSW.2014.79〉. 〈hal-01354689〉

Partager

Métriques

Consultations de la notice

449

Téléchargements de fichiers

53