Coping with silent errors in HPC applications

Abstract : This report describes a unified framework for the detection and correction of silent errors, which constitute a major threat for scientific applications at extreme-scale. We first motivate the problem and explain why checkpointing must be combined with some verification mechanism. Then we introduce a general-purpose technique based upon computational patterns that periodically repeat over time. These patterns interleave verifications and checkpoints, and we show how to determine the pattern minimizing expected execution time. Then we move to application-specific techniques and review dynamic programming algorithms for linear chains of tasks, as well as ABFT-oriented algorithms for iterative methods in sparse linear algebra.
Type de document :
Rapport
[Research Report] RR-8825, CNRS, ENS Lyon & INRIA. 2015
Liste complète des métadonnées

Littérature citée [41 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01242369
Contributeur : Equipe Roma <>
Soumis le : vendredi 11 décembre 2015 - 22:55:40
Dernière modification le : vendredi 20 avril 2018 - 15:44:27
Document(s) archivé(s) le : samedi 29 avril 2017 - 12:12:02

Fichier

RR-8825.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01242369, version 1

Collections

Citation

Guillaume Aupy, Anne Benoit, Massimiliano Fasi, Yves Robert, Hongyang Sun, et al.. Coping with silent errors in HPC applications. [Research Report] RR-8825, CNRS, ENS Lyon & INRIA. 2015. 〈hal-01242369〉

Partager

Métriques

Consultations de la notice

333

Téléchargements de fichiers

120