Skip to Main content Skip to Navigation
New interface
Conference papers

Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI

Camille Coti 1, 2 Thomas Herault 1, 2 Pierre Lemarinier 1, 2 Laurence Pilard 1, 2 Eric Rodriguez 1, 2 
2 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and non-blocking. However they have never been compared quantitatively and their respective scalability remains unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalability. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks.
Document type :
Conference papers
Complete list of metadata
Contributor : Ist Rennes Connect in order to contact the contributor
Submitted on : Tuesday, April 3, 2012 - 2:18:52 PM
Last modification on : Sunday, June 26, 2022 - 11:55:57 AM




Camille Coti, Thomas Herault, Pierre Lemarinier, Laurence Pilard, Eric Rodriguez. Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI. Proceedings of the International Conference for High Performance Networking Computing, Networking, Storage and Analysis (SC2006), Nov 2006, Tampa, United States. ⟨10.1109/SC.2006.15⟩. ⟨hal-00684891⟩



Record views