Skip to Main content Skip to Navigation
New interface
Conference papers

Reliability of Checksum based Detection for Soft Errors in Conjugate Gradient Variants

Emmanuel Agullo 1 Luc Giraud 1 Emrullah Fatih Yetkin 1 
1 HiePACS - High-End Parallel Algorithms for Challenging Numerical Simulations
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest
Abstract : Soft errors that are not detected by hardware mechanisms may be extremely complex to detect at the software layer. One option is to perform a full duplication of the computation (and data) and check on a regular basis that intermediate results are consistent. However, this mechanism may be prohibitive. In the context of CG solver, the most prohibitive operation to duplicate is SpMV. To avoid the duplication of this operation, checksum mechanisms may be employed. In this presentation, we investigate the reliability of such an approach in finite precision arithmetic. We illustrate our discussion with the CGPOP code, a miniapp for performing the CG within the Parallel Ocean Program (POP), which is a candidate for exascale climate simulations.
Complete list of metadata
Contributor : Luc Giraud Connect in order to contact the contributor
Submitted on : Thursday, September 17, 2015 - 9:21:15 AM
Last modification on : Saturday, June 25, 2022 - 7:46:19 PM


  • HAL Id : hal-01200706, version 1



Emmanuel Agullo, Luc Giraud, Emrullah Fatih Yetkin. Reliability of Checksum based Detection for Soft Errors in Conjugate Gradient Variants. SIAM Conference on Computational Science and Engineering (SIAM CSE 2015), Mar 2015, Salt Lake city, Utah, United States. ⟨hal-01200706⟩



Record views