Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI

Darius Buntinas 1 Camille Coti 2, 3 Thomas Hérault 2, 3 Pierre Lemarinier 2 Laurence Pilard 2 Ala Rezmerita 2, 3 Eric Rodriguez 2 Franck Cappello 2, 3
3 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPIs has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinatedcheckpointing or message logging. The most popular approach is with coordinatedcheckpointing. In the literature, two different concepts of coordinatedcheckpointing have been proposed: blocking and non-blocking. However they have never been compared quantitatively, and their respective scalabilities remain unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalabilities. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks.
Type de document :
Article dans une revue
Future Generation Computer Systems, Elsevier, 2008, 24 (1), pp.73-84. 〈10.1016/j.future.2007.02.002〉
Liste complète des métadonnées

https://hal.inria.fr/hal-00688644
Contributeur : Ist Rennes <>
Soumis le : mercredi 18 avril 2012 - 10:50:05
Dernière modification le : jeudi 5 avril 2018 - 12:30:12

Identifiants

Collections

Citation

Darius Buntinas, Camille Coti, Thomas Hérault, Pierre Lemarinier, Laurence Pilard, et al.. Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI. Future Generation Computer Systems, Elsevier, 2008, 24 (1), pp.73-84. 〈10.1016/j.future.2007.02.002〉. 〈hal-00688644〉

Partager

Métriques

Consultations de la notice

294