Hard faults and soft errors: possible numerical remedies in linear algebra solvers - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2016

Hard faults and soft errors: possible numerical remedies in linear algebra solvers

Résumé

The advent of extreme scale machines will require the use of parallel resources at an unprecedented scale, probably leading to a high rate of hard faults and soft errors. Handling fully these faults at the computer system level may have a prohibitive cost. High performance computing applications that aim at exploiting all these resources will thus need to be resilient, i.e., be able to compute a correct solution in presence of faults. We focus on numerical linear algebra problems such as the solution of linear systems or eigenproblems that are the innermost numerical kernels in many scientific and engineering applications and also ones of the most time consuming parts. To address hard fault on computing core, we first present possible remedies based on recovery techniques followed by restarting strategies. In the framework of Krylov subspace linear solvers the lost entries of the iterate are interpolated using the available entries on the still alive cores to define a new initial guess before restarting the Krylov method. In particular, we consider two interpolation policies that preserve key numerical properties of well-known linear solvers. Tackling silent data corruption (SDC) induced by soft-error is somehow more complex as we first need to better understand the impact of SDC on the numerical behavior of the solution scheme. The next step is the design of numerical criteria to possibly detect the faults that prevent to converge and eventually the design of a recovery scheme. In the context of the well-known Conjugate Gradient method we illustrate these three steps as well as preliminary results for GMRES.
Fichier non déposé

Dates et versions

hal-01334675 , version 1 (21-06-2016)

Identifiants

  • HAL Id : hal-01334675 , version 1

Citer

Emmanuel Agullo, Siegfried Cools, Luc Giraud, Alexandre Moreau, Pablo Salas, et al.. Hard faults and soft errors: possible numerical remedies in linear algebra solvers. VecPar - International meeting on High Performance Computing for Computational science, Jun 2016, Porto, Portugal. ⟨hal-01334675⟩
294 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More