Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors

Abstract : This work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop errors. Many others deal with silent errors (or silent data corruptions). But very few papers deal with fail-stop and silent errors simultaneously. However, HPC applications will obviously have to cope with both error sources. This paper presents a unified framework and optimal algorithmic solutions to this double challenge. Silent errors are handled via verification mechanisms (either partially or fully accurate) and in-memory checkpoints. Fail-stop errors are processed via disk checkpoints. All verification and checkpoint types are combined into computational patterns. We provide a unified model, and a full characterization of the optimal pattern. Our results nicely extend several published solutions and demonstrate how to make use of different techniques to solve the double threat of fail-stop and silent errors. Extensive simulations based on real data confirm the accuracy of the model, and show that patterns that combine all resilience mechanisms are required to provide acceptable overheads.
Type de document :
Communication dans un congrès
IPDPS’2016, the 30th IEEE International Parallel and Distributed Processing Symposium, May 2016, Chicago, United States. IEEE Computer Society Press, 2016, Proceedings of IPDPS’2016, the 30th IEEE International Parallel and Distributed Processing Symposium. 〈10.1109/IPDPS.2016.39〉
Liste complète des métadonnées

Littérature citée [21 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01354886
Contributeur : Equipe Roma <>
Soumis le : vendredi 19 août 2016 - 18:52:35
Dernière modification le : mardi 16 janvier 2018 - 15:36:32
Document(s) archivé(s) le : dimanche 20 novembre 2016 - 10:50:15

Fichier

ipdps2016.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Anne Benoit, Aurélien Cavelan, Yves Robert, Hongyang Sun. Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors. IPDPS’2016, the 30th IEEE International Parallel and Distributed Processing Symposium, May 2016, Chicago, United States. IEEE Computer Society Press, 2016, Proceedings of IPDPS’2016, the 30th IEEE International Parallel and Distributed Processing Symposium. 〈10.1109/IPDPS.2016.39〉. 〈hal-01354886〉

Partager

Métriques

Consultations de la notice

319

Téléchargements de fichiers

45