Reliability of Checksum based Detection for Soft Errors in Conjugate Gradient Variants

Emmanuel Agullo 1 Luc Giraud 1 Emrullah Fatih Yetkin 1
1 HiePACS - High-End Parallel Algorithms for Challenging Numerical Simulations
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest
Abstract : Soft errors that are not detected by hardware mechanisms may be extremely complex to detect at the software layer. One option is to perform a full duplication of the computation (and data) and check on a regular basis that intermediate results are consistent. However, this mechanism may be prohibitive. In the context of CG solver, the most prohibitive operation to duplicate is SpMV. To avoid the duplication of this operation, checksum mechanisms may be employed. In this presentation, we investigate the reliability of such an approach in finite precision arithmetic. We illustrate our discussion with the CGPOP code, a miniapp for performing the CG within the Parallel Ocean Program (POP), which is a candidate for exascale climate simulations.
Complete list of metadatas

https://hal.inria.fr/hal-01200706
Contributor : Luc Giraud <>
Submitted on : Thursday, September 17, 2015 - 9:21:15 AM
Last modification on : Thursday, May 9, 2019 - 11:58:11 AM

Identifiers

  • HAL Id : hal-01200706, version 1

Citation

Emmanuel Agullo, Luc Giraud, Emrullah Fatih Yetkin. Reliability of Checksum based Detection for Soft Errors in Conjugate Gradient Variants. SIAM Conference on Computational Science and Engineering (SIAM CSE 2015), Mar 2015, Salt Lake city, Utah, United States. ⟨hal-01200706⟩

Share

Metrics

Record views

405