Fault management in P2P-MPI

Abstract : We present in this paper the recent developments done in P2P-MPI, a grid middleware, concerning the fault management, which covers fault-tolerance for applications and fault detection. P2P-MPI provides a transparent fault tolerance facility based on replication of com- putations. Applications are monitored by a distributed set of external modules called failure detectors. The contribution of this paper is the analysis of the advantages and drawbacks of such detectors for a real implementation, and its integration in P2P-MPI. We pay especially at- tention to the reliability of the failure detection service and to the failure detection speed. We propose a variant of the binary round-robin protocol, which is more reliable than the application execution in any case. Exper- iments on applications of up to 256 processes, carried out on Grid'5000 show that the real detection times closely match the predictions.
Complete list of metadatas

Cited literature [10 references]  Display  Hide  Download

https://hal.inria.fr/inria-00529974
Contributor : Stéphane Genaud <>
Submitted on : Wednesday, October 27, 2010 - 10:31:42 AM
Last modification on : Saturday, January 13, 2018 - 1:03:15 AM
Long-term archiving on : Friday, October 26, 2012 - 12:25:56 PM

File

icps-2007-185.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00529974, version 1

Collections

Citation

Stéphane Genaud, Choopan Rattanapoka. Fault management in P2P-MPI. In proceedings of International Conference on Grid and Pervasive Computing, GPC'07, May 2007, Paris, France. ⟨inria-00529974⟩

Share

Metrics

Record views

149

Files downloads

176