Coping with silent errors in HPC applications - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Chapitre D'ouvrage Année : 2017

Coping with silent errors in HPC applications

Massimiliano Fasi
  • Fonction : Auteur
  • PersonId : 1247865
  • IdHAL : mfasi

Résumé

This chapter describes a unified framework for the detection and correction of silent errors, which constitute a major threat for scientific applications at extreme-scale. We first motivate the problem and explain why checkpointing must be combined with some verification mechanism. Then we introduce a general-purpose technique based upon computational patterns that periodically repeat over time. These patterns interleave verifications and checkpoints, and we show how to determine the pattern minimizing expected execution time. Then we move to application-specific techniques and review dynamic programming algorithms for linear chains of tasks, as well as ABFT-oriented algorithms for iterative methods in sparse linear algebra.
Fichier principal
Vignette du fichier
chapter.pdf (223.08 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01354892 , version 1 (19-08-2016)

Identifiants

Citer

Guillaume Aupy, Anne Benoit, Aurélien Cavelan, Massimiliano Fasi, Yves Robert, et al.. Coping with silent errors in HPC applications. Andy Adamatzky. Emergent Computation, Springer Verlag, 2017, 978-3-319-46375-9. ⟨10.1007/978-3-319-46376-6⟩. ⟨hal-01354892⟩
163 Consultations
155 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More