Skip to Main content Skip to Navigation
New interface
Conference papers

SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing

Abstract : The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most check-pointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation , called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints , and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.
Complete list of metadata

Cited literature [29 references]  Display  Hide  Download
Contributor : Thomas Ropars Connect in order to contact the contributor
Submitted on : Monday, March 2, 2015 - 10:20:27 PM
Last modification on : Friday, October 7, 2022 - 3:48:28 AM
Long-term archiving on: : Tuesday, June 2, 2015 - 10:00:21 AM


Files produced by the author(s)



Thomas Ropars, Tatiana Martsinkevitch, Amina Guermouche, André Schiper, Franck Cappello. SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing. International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13), 2013, Denver, United States. ⟨10.1145/2503210.2503271⟩. ⟨hal-01121951⟩



Record views


Files downloads