Replication for send-deterministic MPI HPC applications

Abstract : Replication has recently gained attention in the context of fault tolerance for large scale MPI HPC applications. Existing implementations try to cover all MPI codes and to be independent from the underlying library. In this paper, we evaluate the advantages of adopting a different approach. First, we try to take advantage of a communication property common to many MPI HPC application, namely send-determinism. Second, we choose to implement replication inside the MPI library. The main advantage of our approach is simplicity. While being only a small patch to the Open MPI library, our solution called SDR-MPI supports most main features of the MPI standard including all collectives and group operations. SDR-MPI additionally achieves good performance: Experiments run with HPC benchmarks and applications show that its overhead remains below 5%.
Type de document :
Communication dans un congrès
3rd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), 2013, New-York City, United States. 2013, 〈10.1145/2465813.2465819〉
Liste complète des métadonnées

Littérature citée [18 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01121949
Contributeur : Thomas Ropars <>
Soumis le : lundi 2 mars 2015 - 22:07:26
Dernière modification le : mardi 24 avril 2018 - 13:52:20
Document(s) archivé(s) le : mardi 2 juin 2015 - 09:56:12

Fichier

ftxs06-lefray.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Arnaud Lefray, Thomas Ropars, André Schiper. Replication for send-deterministic MPI HPC applications. 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), 2013, New-York City, United States. 2013, 〈10.1145/2465813.2465819〉. 〈hal-01121949〉

Partager

Métriques

Consultations de la notice

114

Téléchargements de fichiers

110