Skip to Main content Skip to Navigation
New interface
Conference papers

Hard faults and soft errors: possible numerical remedies in linear algebra solvers

Abstract : The advent of extreme scale machines will require the use of parallel resources at an unprecedented scale, probably leading to a high rate of hard faults and soft errors. Handling fully these faults at the computer system level may have a prohibitive cost. High performance computing applications that aim at exploiting all these resources will thus need to be resilient, i.e., be able to compute a correct solution in presence of faults. We focus on numerical linear algebra problems such as the solution of linear systems or eigenproblems that are the innermost numerical kernels in many scientific and engineering applications and also ones of the most time consuming parts. To address hard fault on computing core, we first present possible remedies based on recovery techniques followed by restarting strategies. In the framework of Krylov subspace linear solvers the lost entries of the iterate are interpolated using the available entries on the still alive cores to define a new initial guess before restarting the Krylov method. In particular, we consider two interpolation policies that preserve key numerical properties of well-known linear solvers. Tackling silent data corruption (SDC) induced by soft-error is somehow more complex as we first need to better understand the impact of SDC on the numerical behavior of the solution scheme. The next step is the design of numerical criteria to possibly detect the faults that prevent to converge and eventually the design of a recovery scheme. In the context of the well-known Conjugate Gradient method we illustrate these three steps as well as preliminary results for GMRES.
Complete list of metadata
Contributor : Luc Giraud Connect in order to contact the contributor
Submitted on : Tuesday, June 21, 2016 - 10:50:04 AM
Last modification on : Thursday, September 1, 2022 - 9:04:06 AM


  • HAL Id : hal-01334675, version 1



Emmanuel Agullo, Siegfried Cools, Luc Giraud, Alexandre Moreau, Pablo Salas, et al.. Hard faults and soft errors: possible numerical remedies in linear algebra solvers. VecPar - International meeting on High Performance Computing for Computational science, Jun 2016, Porto, Portugal. ⟨hal-01334675⟩



Record views