Fault management in P2P-MPI

Abstract : We present in this paper the recent developments done in P2P-MPI, a grid middleware, concerning the fault management, which covers fault-tolerance for applications and fault detection. P2P-MPI provides a transparent fault tolerance facility based on replication of com- putations. Applications are monitored by a distributed set of external modules called failure detectors. The contribution of this paper is the analysis of the advantages and drawbacks of such detectors for a real implementation, and its integration in P2P-MPI. We pay especially at- tention to the reliability of the failure detection service and to the failure detection speed. We propose a variant of the binary round-robin protocol, which is more reliable than the application execution in any case. Exper- iments on applications of up to 256 processes, carried out on Grid'5000 show that the real detection times closely match the predictions.
Type de document :
Communication dans un congrès
C. Cérin and K.-C. Li. In proceedings of International Conference on Grid and Pervasive Computing, GPC'07, May 2007, Paris, France. Springer, 4459, 2007
Liste complète des métadonnées

Littérature citée [10 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00529974
Contributeur : Stéphane Genaud <>
Soumis le : mercredi 27 octobre 2010 - 10:31:42
Dernière modification le : samedi 13 janvier 2018 - 01:03:15
Document(s) archivé(s) le : vendredi 26 octobre 2012 - 12:25:56

Fichier

icps-2007-185.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00529974, version 1

Collections

Citation

Stéphane Genaud, Choopan Rattanapoka. Fault management in P2P-MPI. C. Cérin and K.-C. Li. In proceedings of International Conference on Grid and Pervasive Computing, GPC'07, May 2007, Paris, France. Springer, 4459, 2007. 〈inria-00529974〉

Partager

Métriques

Consultations de la notice

140

Téléchargements de fichiers

155