Skip to Main content Skip to Navigation
New interface
Conference papers

Fault management in P2P-MPI

Abstract : We present in this paper the recent developments done in P2P-MPI, a grid middleware, concerning the fault management, which covers fault-tolerance for applications and fault detection. P2P-MPI provides a transparent fault tolerance facility based on replication of com- putations. Applications are monitored by a distributed set of external modules called failure detectors. The contribution of this paper is the analysis of the advantages and drawbacks of such detectors for a real implementation, and its integration in P2P-MPI. We pay especially at- tention to the reliability of the failure detection service and to the failure detection speed. We propose a variant of the binary round-robin protocol, which is more reliable than the application execution in any case. Exper- iments on applications of up to 256 processes, carried out on Grid'5000 show that the real detection times closely match the predictions.
Complete list of metadata

Cited literature [10 references]  Display  Hide  Download
Contributor : Stéphane Genaud Connect in order to contact the contributor
Submitted on : Wednesday, October 27, 2010 - 10:31:42 AM
Last modification on : Monday, May 4, 2020 - 11:38:49 AM
Long-term archiving on: : Friday, October 26, 2012 - 12:25:56 PM


Files produced by the author(s)


  • HAL Id : inria-00529974, version 1



Stéphane Genaud, Choopan Rattanapoka. Fault management in P2P-MPI. In proceedings of International Conference on Grid and Pervasive Computing, GPC'07, May 2007, Paris, France. ⟨inria-00529974⟩



Record views


Files downloads