MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI}

Aurélien Bouteiller 1 T. Herault G. Krawezik P. Lemarinier Franck Cappello 1, 2
2 GRAND-LARGE - Global parallel and distributed computing
CNRS - Centre National de la Recherche Scientifique : UMR8623, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LIFL - Laboratoire d'Informatique Fondamentale de Lille, LRI - Laboratoire de Recherche en Informatique
Abstract : High performance computing platforms such as Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing libraries in HPC applications. These two trends raise the need for fault-tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault-tolerant protocols for MPI applications. We present an extensive related work section highlighting the originality of our approach and the proposed protocols. We then present four fault-tolerant protocols implemented in a new generic framework for fault-tolerant protocol comparison, covering a large spectrum of known approaches from coordinated checkpoint, to uncoordinated checkpoint associated with causal message logging. We measure the performance of these protocols on a micro-benchmark and compare them with the NAS benchmark, using an original fault tolerance test. Finally, we outline the lessons learned from this in depth fault-tolerant protocol comparison of MPI applications.
Type de document :
Article dans une revue
International Journal of High Performance Computing Applications, SAGE Publications, 2006, 20 (3), pp.319-333. 〈10.1177/1094342006067469〉
Liste complète des métadonnées

https://hal.inria.fr/hal-00688637
Contributeur : Ist Rennes <>
Soumis le : mercredi 18 avril 2012 - 10:42:41
Dernière modification le : jeudi 5 avril 2018 - 12:30:12

Lien texte intégral

Identifiants

Collections

Citation

Aurélien Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, Franck Cappello. MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI}. International Journal of High Performance Computing Applications, SAGE Publications, 2006, 20 (3), pp.319-333. 〈10.1177/1094342006067469〉. 〈hal-00688637〉

Partager

Métriques

Consultations de la notice

181