On the resilience of parallel sparse hybrid solvers - Archive ouverte HAL Access content directly
Conference Papers Year :

On the resilience of parallel sparse hybrid solvers

(1) , (1) , (1)
1

Abstract

As the computational power of high performance computing (HPC) systems continues to increase by using a huge number of CPU cores or specialized processing units, extreme-scale applications are increasingly prone to faults. Consequently, the HPC community has proposed many contributions to design resilient HPC applications. These contributions may be system-oriented, theoretical or numerical. In this study we consider an actual fully-featured parallel sparse hybrid (direct/iterative) linear solver, MAPHYS, and we propose numerical remedies to design a resilient version of the solver. The solver being hybrid, we focus in this study on the iterative solution step, which is often the dominant step in practice. We furthermore assume that a separate mechanism ensures fault detection and that a system layer provides support for setting back the environment (processes,. . .) in a running state. The present manuscript therefore focuses on (and only on) strategies for recovering lost data after the fault has been detected (a separate concern beyond the scope of this study), once the system is restored (another separate concern not studied here). The numerical remedies we propose are twofold. Whenever possible, we exploit the natural data redundancy between processes from the solver to perform exact recovery through clever copies over processes. Otherwise, data that has been lost and no longer available on any process is recovered through a so-called interpolation-restart mechanism. This mechanism is derived from [1] by carefully taking into account the properties of the target hybrid solver. These numerical remedies have been implemented in the MAPHYS parallel solver so that we can assess their efficiency on a large number of processing units (up to 12, 288 CPU cores) for solving large-scale real-life problems.
Fichier principal
Vignette du fichier
Paper_HiPC.pdf (441.75 Ko) Télécharger le fichier
Origin : Files produced by the author(s)
Loading...

Dates and versions

hal-01256316 , version 1 (14-01-2016)
hal-01256316 , version 2 (22-02-2018)

Identifiers

  • HAL Id : hal-01256316 , version 2

Cite

Emmanuel Agullo, Luc Giraud, Mawussi Zounon. On the resilience of parallel sparse hybrid solvers. HiPC 2015 - IEEE International Conference on High Performance Computing, Dec 2015, Bangalore, India. ⟨hal-01256316v2⟩
316 View
198 Download

Share

Gmail Facebook Twitter LinkedIn More