Reliability of Checksum based Detection for Soft Errors in Conjugate Gradient Variants

Emmanuel Agullo 1 Luc Giraud 1 Emrullah Fatih Yetkin 1
1 HiePACS - High-End Parallel Algorithms for Challenging Numerical Simulations
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest
Abstract : Soft errors that are not detected by hardware mechanisms may be extremely complex to detect at the software layer. One option is to perform a full duplication of the computation (and data) and check on a regular basis that intermediate results are consistent. However, this mechanism may be prohibitive. In the context of CG solver, the most prohibitive operation to duplicate is SpMV. To avoid the duplication of this operation, checksum mechanisms may be employed. In this presentation, we investigate the reliability of such an approach in finite precision arithmetic. We illustrate our discussion with the CGPOP code, a miniapp for performing the CG within the Parallel Ocean Program (POP), which is a candidate for exascale climate simulations.
Type de document :
Communication dans un congrès
SIAM Conference on Computational Science and Engineering (SIAM CSE 2015), Mar 2015, Salt Lake city, Utah, United States. 2015, 〈https://www.siam.org/meetings/cse15/〉
Liste complète des métadonnées

https://hal.inria.fr/hal-01200706
Contributeur : Luc Giraud <>
Soumis le : jeudi 17 septembre 2015 - 09:21:15
Dernière modification le : jeudi 11 janvier 2018 - 06:22:35

Identifiants

  • HAL Id : hal-01200706, version 1

Collections

Citation

Emmanuel Agullo, Luc Giraud, Emrullah Fatih Yetkin. Reliability of Checksum based Detection for Soft Errors in Conjugate Gradient Variants. SIAM Conference on Computational Science and Engineering (SIAM CSE 2015), Mar 2015, Salt Lake city, Utah, United States. 2015, 〈https://www.siam.org/meetings/cse15/〉. 〈hal-01200706〉

Partager

Métriques

Consultations de la notice

379