SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing

Abstract : The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most check-pointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation , called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints , and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.
Complete list of metadatas

Cited literature [29 references]  Display  Hide  Download

https://hal.inria.fr/hal-01121951
Contributor : Thomas Ropars <>
Submitted on : Monday, March 2, 2015 - 10:20:27 PM
Last modification on : Thursday, August 1, 2019 - 2:12:06 PM
Long-term archiving on : Tuesday, June 2, 2015 - 10:00:21 AM

File

sc2013.pdf
Files produced by the author(s)

Identifiers

Citation

Thomas Ropars, Tatiana Martsinkevitch, Amina Guermouche, André Schiper, Franck Cappello. SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing. International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13), 2013, Denver, United States. ⟨10.1145/2503210.2503271⟩. ⟨hal-01121951⟩

Share

Metrics

Record views

505

Files downloads

163