On the resilience of a parallel sparse hybrid solver

Emmanuel Agullo 1 Luc Giraud 1 Mawussi Zounon 1
1 HiePACS - High-End Parallel Algorithms for Challenging Numerical Simulations
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest
Abstract : As the computational power of high performance computing (HPC) systems continues to increase by using a huge number of CPU cores or specialized processing units, extreme-scale applications are increasingly prone to faults. As a consequence, the HPC community has proposed many contributions to design resilient HPC applications, may these contributions be system-oriented, theoretical or numerical. In this study we consider an actual fully-featured parallel sparse hybrid (direct/iterative) linear solver, \maphys, and we propose numerical remedies to design a resilient version of the solver. The solver being hybrid, we focus in this study on the iterative solution step, which is often the dominant step in practice. We furthermore assume that a separate mechanism ensures fault detection and that a system layer provides support for setting back the environment (processes, \ldots) in a running state. The present manuscript therefore focuses on (and only on) strategies for recovering lost data \emph{after} the fault has been detected (a separate concern out of the scope of this study) and \emph{once} the system is back in a running state (another separate concern not studied here either). The numerical remedies we propose are twofold. Whenever possible, we exploit the natural data redundancy between processes from the solver to perform exact recovery through clever copies over processes. Otherwise, data that has been lost and is not available anymore on any process is recovered through a so-called interpolation-restart mechanism. This mechanism is derived from~\cite{aggr:13} to carefully take into account the properties of the target hybrid solver. These numerical remedies have been implemented in the \maphys parallel solver so that we can assess their efficiency on a large number of processing units (up to $12,288$ CPU cores) for solving large-scale real-life problems.
Type de document :
Rapport
[Research Report] RR-8744, INRIA Bordeaux; INRIA. 2015
Liste complète des métadonnées

Littérature citée [24 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01165186
Contributeur : Mawussi Zounon <>
Soumis le : lundi 22 juin 2015 - 10:38:51
Dernière modification le : samedi 17 septembre 2016 - 01:36:51
Document(s) archivé(s) le : mardi 25 avril 2017 - 18:06:18

Fichier

RR-8744.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01165186, version 2

Collections

Citation

Emmanuel Agullo, Luc Giraud, Mawussi Zounon. On the resilience of a parallel sparse hybrid solver. [Research Report] RR-8744, INRIA Bordeaux; INRIA. 2015. 〈hal-01165186v2〉

Partager

Métriques

Consultations de
la notice

351

Téléchargements du document

126