Skip to Main content Skip to Navigation
New interface
Conference papers

Recover-Restart Strategies for Resilient Parallel Numerical Linear Algebra Solvers

Abstract : The advent of extreme scale machines will require the use of parallel resources at an unprecedented scale, probably leading to a high rate of hardware faults. Handling fully these faults at the computer system level may have a prohibitive cost. High performance computing applications that aim at exploiting all these resources will thus need to be resilient, i.e. be able to compute a correct solution in presence of core crashes. In this work, we investigate possible remedies in the framework of numerical linear algebra problems such as the solution of linear systems or eigen-problems that are the inner most numerical kernels in many scientifi c and engineering applications and also ones of the most time consuming parts. More precisely, we present recovery techniques followed by restarting strategies. In the framework of Krylov subspace linear solvers the lost entries of the iterate are interpolated using the available entries on the still alive cores to defi ne a new initial guess before restarting the Krylov method. In particular, we consider two interpolation policies that preserve key numerical properties of well-known linear solvers, namely the monotony decrease of the A-norm of the error of the conjugate gradient or the residual norm decrease of GMRES. We extend these interpolation ideas in the context of some state of the art eigensolvers where these recovery approaches are applied to reconstruct a meaningful search space for restarting. We assess the impact of the recovery method, the fault rate and the number of processors on the robustness of the resulting numerical linear solvers.
Complete list of metadata
Contributor : Mawussi Zounon Connect in order to contact the contributor
Submitted on : Tuesday, August 26, 2014 - 10:55:14 AM
Last modification on : Saturday, June 25, 2022 - 7:42:24 PM


  • HAL Id : hal-01058138, version 1



Emmanuel Agullo, Luc Giraud, Pablo Salas, Mawussi Zounon. Recover-Restart Strategies for Resilient Parallel Numerical Linear Algebra Solvers. International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2014), Jul 2014, Lugano, Switzerland. ⟨hal-01058138⟩



Record views