HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Book sections

Coping with silent errors in HPC applications

Abstract : This chapter describes a unified framework for the detection and correction of silent errors, which constitute a major threat for scientific applications at extreme-scale. We first motivate the problem and explain why checkpointing must be combined with some verification mechanism. Then we introduce a general-purpose technique based upon computational patterns that periodically repeat over time. These patterns interleave verifications and checkpoints, and we show how to determine the pattern minimizing expected execution time. Then we move to application-specific techniques and review dynamic programming algorithms for linear chains of tasks, as well as ABFT-oriented algorithms for iterative methods in sparse linear algebra.
Complete list of metadata

Cited literature [41 references]  Display  Hide  Download

Contributor : Equipe Roma Connect in order to contact the contributor
Submitted on : Friday, August 19, 2016 - 8:39:11 PM
Last modification on : Monday, May 16, 2022 - 4:46:02 PM
Long-term archiving on: : Sunday, November 20, 2016 - 10:48:57 AM


Files produced by the author(s)




Guillaume Aupy, Anne Benoit, Aurélien Cavelan, Massimiliano Fasi, Yves Robert, et al.. Coping with silent errors in HPC applications. Andy Adamatzky. Emergent Computation, Springer Verlag, 2017, 978-3-319-46375-9. ⟨10.1007/978-3-319-46376-6⟩. ⟨hal-01354892⟩



Record views


Files downloads