Fault management in P2P-MPI
Résumé
We present in this paper the recent developments done in P2P-MPI, a grid middleware, concerning the fault management, which covers fault-tolerance for applications and fault detection. P2P-MPI provides a transparent fault tolerance facility based on replication of com- putations. Applications are monitored by a distributed set of external modules called failure detectors. The contribution of this paper is the analysis of the advantages and drawbacks of such detectors for a real implementation, and its integration in P2P-MPI. We pay especially at- tention to the reliability of the failure detection service and to the failure detection speed. We propose a variant of the binary round-robin protocol, which is more reliable than the application execution in any case. Exper- iments on applications of up to 256 processes, carried out on Grid'5000 show that the real detection times closely match the predictions.
Origine : Fichiers produits par l'(les) auteur(s)
Loading...