SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing

Abstract : The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most check-pointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation , called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints , and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.
Type de document :
Communication dans un congrès
International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13), 2013, Denver, United States. 2013, 〈10.1145/2503210.2503271〉
Liste complète des métadonnées

Littérature citée [29 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01121951
Contributeur : Thomas Ropars <>
Soumis le : lundi 2 mars 2015 - 22:20:27
Dernière modification le : mardi 24 avril 2018 - 13:37:42
Document(s) archivé(s) le : mardi 2 juin 2015 - 10:00:21

Fichier

sc2013.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Thomas Ropars, Tatiana Martsinkevitch, Amina Guermouche, André Schiper, Franck Cappello. SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing. International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13), 2013, Denver, United States. 2013, 〈10.1145/2503210.2503271〉. 〈hal-01121951〉

Partager

Métriques

Consultations de la notice

455

Téléchargements de fichiers

112