Recover-Restart Strategies for Resilient Parallel Numerical Linear Algebra Solvers

Abstract : The advent of extreme scale machines will require the use of parallel resources at an unprecedented scale, probably leading to a high rate of hardware faults. Handling fully these faults at the computer system level may have a prohibitive cost. High performance computing applications that aim at exploiting all these resources will thus need to be resilient, i.e. be able to compute a correct solution in presence of core crashes. In this work, we investigate possible remedies in the framework of numerical linear algebra problems such as the solution of linear systems or eigen-problems that are the inner most numerical kernels in many scientifi c and engineering applications and also ones of the most time consuming parts. More precisely, we present recovery techniques followed by restarting strategies. In the framework of Krylov subspace linear solvers the lost entries of the iterate are interpolated using the available entries on the still alive cores to defi ne a new initial guess before restarting the Krylov method. In particular, we consider two interpolation policies that preserve key numerical properties of well-known linear solvers, namely the monotony decrease of the A-norm of the error of the conjugate gradient or the residual norm decrease of GMRES. We extend these interpolation ideas in the context of some state of the art eigensolvers where these recovery approaches are applied to reconstruct a meaningful search space for restarting. We assess the impact of the recovery method, the fault rate and the number of processors on the robustness of the resulting numerical linear solvers.
Type de document :
Communication dans un congrès
International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2014), Jul 2014, Lugano, Switzerland. 2014
Liste complète des métadonnées

https://hal.inria.fr/hal-01058138
Contributeur : Mawussi Zounon <>
Soumis le : mardi 26 août 2014 - 10:55:14
Dernière modification le : jeudi 11 janvier 2018 - 06:22:35

Identifiants

  • HAL Id : hal-01058138, version 1

Collections

Citation

Emmanuel Agullo, Luc Giraud, Pablo Salas, Mawussi Zounon. Recover-Restart Strategies for Resilient Parallel Numerical Linear Algebra Solvers. International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2014), Jul 2014, Lugano, Switzerland. 2014. 〈hal-01058138〉

Partager

Métriques

Consultations de la notice

415