Fault management in P2P-MPI

Abstract : We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid'5000, show that the real detection times closely match the predictions.
Type de document :
Article dans une revue
International Journal of Parallel Programming, Springer Verlag, 2009, 37 (5), pp.433-461. 〈10.1007/s10766-009-0115-8〉
Liste complète des métadonnées

https://hal.inria.fr/inria-00425516
Contributeur : Stéphane Genaud <>
Soumis le : mercredi 21 octobre 2009 - 21:04:08
Dernière modification le : jeudi 11 janvier 2018 - 06:22:43

Identifiants

Collections

Citation

Stéphane Genaud, Emmanuel Jeannot, Choopan Rattanapoka. Fault management in P2P-MPI. International Journal of Parallel Programming, Springer Verlag, 2009, 37 (5), pp.433-461. 〈10.1007/s10766-009-0115-8〉. 〈inria-00425516〉

Partager

Métriques

Consultations de la notice

304