Hard faults and soft errors: possible numerical remedies in linear algebra solvers

Abstract : The advent of extreme scale machines will require the use of parallel resources at an unprecedented scale, probably leading to a high rate of hard faults and soft errors. Handling fully these faults at the computer system level may have a prohibitive cost. High performance computing applications that aim at exploiting all these resources will thus need to be resilient, i.e., be able to compute a correct solution in presence of faults. We focus on numerical linear algebra problems such as the solution of linear systems or eigenproblems that are the innermost numerical kernels in many scientific and engineering applications and also ones of the most time consuming parts. To address hard fault on computing core, we first present possible remedies based on recovery techniques followed by restarting strategies. In the framework of Krylov subspace linear solvers the lost entries of the iterate are interpolated using the available entries on the still alive cores to define a new initial guess before restarting the Krylov method. In particular, we consider two interpolation policies that preserve key numerical properties of well-known linear solvers. Tackling silent data corruption (SDC) induced by soft-error is somehow more complex as we first need to better understand the impact of SDC on the numerical behavior of the solution scheme. The next step is the design of numerical criteria to possibly detect the faults that prevent to converge and eventually the design of a recovery scheme. In the context of the well-known Conjugate Gradient method we illustrate these three steps as well as preliminary results for GMRES.
Type de document :
Communication dans un congrès
VecPar - International meeting on High Performance Computing for Computational science, Jun 2016, Porto, Portugal. 〈http://vecpar.fe.up.pt/2016/index.html〉
Liste complète des métadonnées

https://hal.inria.fr/hal-01334675
Contributeur : Luc Giraud <>
Soumis le : mardi 21 juin 2016 - 10:50:04
Dernière modification le : jeudi 11 janvier 2018 - 01:51:38

Identifiants

  • HAL Id : hal-01334675, version 1

Collections

Citation

Emmanuel Agullo, Siegfried Cools, Luc Giraud, Alexandre Moreau, Pablo Salas, et al.. Hard faults and soft errors: possible numerical remedies in linear algebra solvers. VecPar - International meeting on High Performance Computing for Computational science, Jun 2016, Porto, Portugal. 〈http://vecpar.fe.up.pt/2016/index.html〉. 〈hal-01334675〉

Partager

Métriques

Consultations de la notice

339