Skip to Main content Skip to Navigation
Conference papers

Replication for send-deterministic MPI HPC applications

Abstract : Replication has recently gained attention in the context of fault tolerance for large scale MPI HPC applications. Existing implementations try to cover all MPI codes and to be independent from the underlying library. In this paper, we evaluate the advantages of adopting a different approach. First, we try to take advantage of a communication property common to many MPI HPC application, namely send-determinism. Second, we choose to implement replication inside the MPI library. The main advantage of our approach is simplicity. While being only a small patch to the Open MPI library, our solution called SDR-MPI supports most main features of the MPI standard including all collectives and group operations. SDR-MPI additionally achieves good performance: Experiments run with HPC benchmarks and applications show that its overhead remains below 5%.
Document type :
Conference papers
Complete list of metadatas

Cited literature [18 references]  Display  Hide  Download

https://hal.inria.fr/hal-01121949
Contributor : Thomas Ropars <>
Submitted on : Monday, March 2, 2015 - 10:07:26 PM
Last modification on : Monday, May 4, 2020 - 11:37:52 AM
Document(s) archivé(s) le : Tuesday, June 2, 2015 - 9:56:12 AM

File

ftxs06-lefray.pdf
Files produced by the author(s)

Identifiers

Collections

Citation

Arnaud Lefray, Thomas Ropars, André Schiper. Replication for send-deterministic MPI HPC applications. 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), 2013, New-York City, United States. ⟨10.1145/2465813.2465819⟩. ⟨hal-01121949⟩

Share

Metrics

Record views

219

Files downloads

334