Towards resilient parallel linear Krylov solvers: recover-restart strategies

Abstract : The advent of extreme scale machines will require the use of parallel resources at an unprecedented scale, probably leading to a high rate of hardware faults. High Performance Computing (HPC) applications that aim at exploiting all these resources will thus need to be resilient, \emph{i.e.}, be able to compute a correct solution in presence of faults. In this work, we investigate possible remedies in the framework of the solution of large sparse linear systems that is often the inner most numerical kernel in many scientific and engineering applications and also one of the most time consuming part. More precisely, we present recovery followed by restarting strategies in the framework of Krylov subspace solvers where lost entries of the iterate are interpolated to define a new initial guess before restarting. In particular, we consider two interpolation policies that preserve key numerical properties of well-known solvers, namely the monotony decrease of the A-norm of the error of the conjugate gradient (CG) or the residual norm decrease of GMRES. We assess the impact of the recovery method, the fault rate and the number of processors on the robustness of the resulting linear solvers. We consider experiments with CG, GMRES and Bi-CGStab.
Type de document :
Rapport
[Research Report] RR-8324, INRIA. 2013, pp.36
Liste complète des métadonnées

Littérature citée [26 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00843992
Contributeur : Luc Giraud <>
Soumis le : vendredi 12 juillet 2013 - 15:26:27
Dernière modification le : lundi 18 septembre 2017 - 09:52:07
Document(s) archivé(s) le : mercredi 5 avril 2017 - 10:39:05

Fichier

RR-8324.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00843992, version 1

Citation

Emmanuel Agullo, Luc Giraud, Abdou Guermouche, Jean Roman, Mawussi Zounon. Towards resilient parallel linear Krylov solvers: recover-restart strategies. [Research Report] RR-8324, INRIA. 2013, pp.36. 〈hal-00843992〉

Partager

Métriques

Consultations de
la notice

564

Téléchargements du document

393