Skip to Main content Skip to Navigation
Conference papers

Replication for send-deterministic MPI HPC applications

Abstract : Replication has recently gained attention in the context of fault tolerance for large scale MPI HPC applications. Existing implementations try to cover all MPI codes and to be independent from the underlying library. In this paper, we evaluate the advantages of adopting a different approach. First, we try to take advantage of a communication property common to many MPI HPC application, namely send-determinism. Second, we choose to implement replication inside the MPI library. The main advantage of our approach is simplicity. While being only a small patch to the Open MPI library, our solution called SDR-MPI supports most main features of the MPI standard including all collectives and group operations. SDR-MPI additionally achieves good performance: Experiments run with HPC benchmarks and applications show that its overhead remains below 5%.
Document type :
Conference papers
Complete list of metadata

Cited literature [18 references]  Display  Hide  Download
Contributor : Thomas Ropars Connect in order to contact the contributor
Submitted on : Monday, March 2, 2015 - 10:07:26 PM
Last modification on : Saturday, September 11, 2021 - 3:17:41 AM
Long-term archiving on: : Tuesday, June 2, 2015 - 9:56:12 AM


Files produced by the author(s)




Arnaud Lefray, Thomas Ropars, André Schiper. Replication for send-deterministic MPI HPC applications. 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), 2013, New-York City, United States. ⟨10.1145/2465813.2465819⟩. ⟨hal-01121949⟩



Les métriques sont temporairement indisponibles