Towards resilient parallel linear Krylov solvers: recover-restart strategies

Abstract : : The advent of extreme scale machines will require the use of parallel resources at an unprecedented scale, probably leading to a high rate of hardware faults. High Performance Computing (HPC) applications that aim at exploiting all these resources will thus need to be resilient, \emph{i.e.}, be able to compute a correct solution in presence of faults. In this work, we investigate possible remedies in the framework of the solution of large sparse linear systems that is often the inner most numerical kernel in many scientific and engineering applications and also one of the most time consuming part. More precisely, we present recovery followed by restarting strategies in the framework of Krylov subspace solvers where lost entries of the iterate are interpolated to define a new initial guess before restarting. In particular, we consider two interpolation policies that preserve key numerical properties of well-known solvers, namely the monotony decrease of the A-norm of the error of the conjugate gradient (CG) or the residual norm decrease of GMRES. We assess the impact of the recovery method, the fault rate and the number of processors on the robustness of the resulting linear solvers. We consider experiments with CG, GMRES and Bi-CGStab.
Type de document :
Communication dans un congrès
Sparse days 2013, Jun 2013, Toulouse, France. 2013
Liste complète des métadonnées
Contributeur : Mawussi Zounon <>
Soumis le : jeudi 23 janvier 2014 - 18:44:38
Dernière modification le : jeudi 11 janvier 2018 - 06:22:35


  • HAL Id : hal-00935685, version 1


Emmanuel Agullo, Luc Giraud, Abdou Guermouche, Jean Roman, Mawussi Zounon. Towards resilient parallel linear Krylov solvers: recover-restart strategies. Sparse days 2013, Jun 2013, Toulouse, France. 2013. 〈hal-00935685〉



Consultations de la notice