Numerical recovery strategies for parallel resilient Krylov linear solvers

Emmanuel Agullo 1 Luc Giraud 1 Abdou Guermouche 1, 2 Jean Roman 1 Mawussi Zounon 1
1 HiePACS - High-End Parallel Algorithms for Challenging Numerical Simulations
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest
Abstract : As the computational power of high performance computing (HPC) systems continues to increase by using a huge number of cores or specialized processing units, HPC applications are increasingly prone to faults. In this paper, we present a new class of numerical fault tolerance algorithms to cope with node crashes in parallel distributed environments. This new resilient scheme is designed at application level and does not require extra resources, i.e., computational unit or computing time, when no fault occurs. In the framework of iterative methods for the solution of sparse linear systems, we present numerical algorithms to extract relevant information from available data after a fault, assuming a separate mechanism ensures the fault detection. After data extraction, a well chosen part of missing data is regenerated through interpolation strategies to constitute meaningful inputs to restart the iterative scheme. We have developed these methods, referred to as Interpolation-Restart techniques, for Krylov subspace linear solvers. After a fault, lost entries of the current iterate computed by the solver are interpolated to define a new initial guess to restart the Krylov method. A well suited initial guess is computed by using the entries of the faulty iterate available on surviving nodes. We present two interpolation policies that preserve key numerical properties of well-known linear solvers, namely the monotonic decrease of the A-norm of the error of the conjugate gradient or the residual norm decrease of GMRES. The qualitative numerical behavior of the resulting scheme have been validated with sequential simulations, when the number of faults and the amount of data losses are varied. Finally, the computational costs associated with the recovery mechanism have been evaluated through parallel experiments.
Type de document :
Article dans une revue
Numerical Linear Algebra with Applications, Wiley, 2016, 23 (5), pp.888--905. 〈10.1002/nla.2059〉
Liste complète des métadonnées

Littérature citée [43 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01323192
Contributeur : Luc Giraud <>
Soumis le : lundi 30 mai 2016 - 11:26:42
Dernière modification le : lundi 18 septembre 2017 - 09:52:10
Document(s) archivé(s) le : mercredi 31 août 2016 - 10:20:13

Fichier

final_nlaa.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Emmanuel Agullo, Luc Giraud, Abdou Guermouche, Jean Roman, Mawussi Zounon. Numerical recovery strategies for parallel resilient Krylov linear solvers. Numerical Linear Algebra with Applications, Wiley, 2016, 23 (5), pp.888--905. 〈10.1002/nla.2059〉. 〈hal-01323192〉

Partager

Métriques

Consultations de la notice

637

Téléchargements de fichiers

291