A. Agbaria and R. Friedman, Starfish: fault-tolerant dynamic MPI programs on clusters of workstations, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469), pp.167-176, 1999.
DOI : 10.1109/HPDC.1999.805295

R. Badrinath and C. Morin, Common mechanisms for supporting fault tolerance in DSM and message passing systems, 2003.
URL : https://hal.archives-ouvertes.fr/hal-01272454

G. Bosilca, A. Bouteiller, F. Cappello, S. Djailali, G. Fedak et al., MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes, ACM/IEEE SC 2002 Conference (SC'02), pp.29-47, 2002.
DOI : 10.1109/SC.2002.10048

URL : https://hal.archives-ouvertes.fr/in2p3-00457138

K. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.
DOI : 10.1145/214451.214456

M. Costa, P. Guedes, M. Sequeira, N. Neves, and M. Castro, Lightweight Logging for Lazy Release Consistent Distributed Shared Memory, Operating Systems Design and Implementation, pp.59-73, 1996.

M. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

S. Monnet, Conception et évaluation d'un protocole de reprise d'applications parallèles dans une fédération de grappes de calculateurs, 2003.

C. Morin, A. Kermarrec, M. Banâtre, and A. Gefflaut, An efficient and scalable approach for implementing fault-tolerant DSM architectures, IEEE Transactions on Computers, vol.49, issue.5, pp.414-430, 2000.
DOI : 10.1109/12.859537

URL : https://hal.archives-ouvertes.fr/inria-00073588

A. Nguyen-tuong, Integrating Fault-Tolerance Techniques in Grid Applications, 2000.

H. Paul, A. Gupta, and R. Badrinath, Hierarchical Coordinated Checkpointing Protocol, International Conference on Parallel and Distributed Computing Systems, pp.240-245, 2002.

J. Rough and A. Goscinski, Exploiting operating system services to efficiently checkpoint parallel applications in GENESIS, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings., 2002.
DOI : 10.1109/ICAPP.2002.1173584

R. , U. De-recherche, I. Lorraine, V. Technopôle-de-nancy-brabois, I. Lès-nancy-unité-de-recherche et al., Campus scientifique, 615 rue du Jardin Botanique Irisa, Campus universitaire de Beaulieu, 35042 RENNES Cedex Unité de recherche INRIA Rhône-Alpes, p.78153, 2004.