SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2013

SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing

Résumé

The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most check-pointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation , called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints , and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.
Fichier principal
Vignette du fichier
sc2013.pdf (433.12 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01121951 , version 1 (02-03-2015)

Identifiants

Citer

Thomas Ropars, Tatiana Martsinkevitch, Amina Guermouche, André Schiper, Franck Cappello. SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing. International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13), 2013, Denver, United States. ⟨10.1145/2503210.2503271⟩. ⟨hal-01121951⟩
421 Consultations
170 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More