M. Litzkow and M. Solomon, The evolution of condor checkpointing, pp.163-164, 1999.

J. S. Plank, M. Beck, G. Kingsley, and K. Li, Libckpt: Transparent checkpointing under unix, 1994.

K. K. Andrey-mirkin and A. Kuznetsov, Containers checkpointing and live migration, Linux Symposium, p.101, 2008.

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak et al., MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes, ACM/IEEE SC 2002 Conference (SC'02), pp.1-18, 2002.
DOI : 10.1109/SC.2002.10048

URL : https://hal.archives-ouvertes.fr/in2p3-00457138

J. Ansel, K. Arya, and G. Cooperman, DMTCP: Transparent checkpointing for cluster computations and the desktop, 2009 IEEE International Symposium on Parallel & Distributed Processing, 2009.
DOI : 10.1109/IPDPS.2009.5161063

T. Cortes, C. Franke, Y. Jégou, T. Kielmann, D. Laforenza et al., XtreemOS: a Vision for a Grid Operating System, 2008.

J. Mehnert-spahn, T. Ropars, M. Schoettner, and C. Morin, The Architecture of the XtreemOS Grid Checkpointing Service, Euro-Par '09: Proceedings of the 15th International Euro-Par Conference on Parallel Processing, pp.429-441, 2009.
DOI : 10.1177/1094342005056139

URL : https://hal.archives-ouvertes.fr/inria-00424009

J. Mehnert-spahn and M. Schoettner, Checkpointing and Migration of Communication Channels in Heterogeneous Grid Environments, The 10th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP), 2010.
DOI : 10.1007/978-3-642-13119-6_23

H. P. Reiser, R. Kapitza, J. Domaschka, and F. J. Hauck, Fault-Tolerant Replication Based on Fragmented Objects, Proceedings of the 6th IFIP WG 6.1 International Conference on Distributed Applications and Interoperable Systems, pp.14-16, 2006.
DOI : 10.1007/11773887_20

R. Strom and S. Yemini, Optimistic recovery in distributed systems, ACM Transactions on Computer Systems, vol.3, issue.3, pp.204-226, 1985.
DOI : 10.1145/3959.3962

R. D. Schlichting and F. B. Schneider, Fail-stop processors: an approach to designing fault-tolerant computing systems, ACM Transactions on Computer Systems, vol.1, issue.3, 1983.
DOI : 10.1145/357369.357371

F. Hupfeld, T. Cortes, B. Kolbeck, J. Stender, E. Focht et al., The XtreemFS architecture-a case for object-based file systems in Grids, Concurrency and Computation: Practice and Experience, vol.11, issue.2, pp.2049-2060, 2008.
DOI : 10.1002/cpe.1304

Y. Wang and W. K. Fuchs, Lazy checkpoint coordination for bounding rollback propagation, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems, pp.78-85, 1993.
DOI : 10.1109/RELDIS.1993.393471

B. A. Kuperman and E. Spafford, Generation of application level audit data via library interposition, Tech. Rep, 1999.

D. Margery, G. Vallee, R. Lottiaux, C. Morin, and J. Berthou, Kerrighed: A ssi cluster os running openmp, Proc. 5th European Workshop on OpenMP, 2003.
URL : https://hal.archives-ouvertes.fr/hal-01272452

T. Ropars and C. Morin, Fault Tolerance in Cluster Federations with O2P-CF, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pp.807-812, 2008.
DOI : 10.1109/CCGRID.2008.76

URL : https://hal.archives-ouvertes.fr/inria-00424025

Q. Jiang, Y. Luo, and D. Manivannan, An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems, Journal of Parallel and Distributed Computing, vol.68, issue.12, pp.1575-1589, 2008.
DOI : 10.1016/j.jpdc.2008.08.003

G. Stellner, CoCheck: checkpointing and process migration for MPI, Proceedings of International Conference on Parallel Processing, pp.526-531, 1996.
DOI : 10.1109/IPPS.1996.508106

S. Sankaran, J. M. Squyres, B. Barrett, and A. Lumsdaine, The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing, Proceedings, LACSI Symposium, pp.479-493, 2003.
DOI : 10.1177/1094342005056139

A. Agbaria and R. Friedman, Starfish: fault-tolerant dynamic MPI programs on clusters of workstations, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469), p.31, 1999.
DOI : 10.1109/HPDC.1999.805295

A. Bouteiller, F. Cappello, T. Herault, G. Krawezik, P. Lemarinier et al., MPICH-V2, Proceedings of the 2003 ACM/IEEE conference on Supercomputing, SC '03, p.25, 2003.
DOI : 10.1145/1048935.1050176

P. Sens and B. Folliot, The STAR fault manager for distributed operating environments. design, implementation and performance, Software: Practice and Experience, vol.28, issue.10, pp.1079-1099, 1998.
DOI : 10.1002/(SICI)1097-024X(199808)28:10<1079::AID-SPE199>3.0.CO;2-D

URL : https://hal.archives-ouvertes.fr/hal-01198740

A. Ciuffoletti, A. Congiusta, G. Jankowski, M. Jankowski, N. Meyer et al., Grid Infrastructure Architecture: A Modular Approach from CoreGRID, CoreGRID Project, 2007.
DOI : 10.1007/978-3-540-68262-2_6

J. Mehnert-spahn, E. Feller, and M. Schoettner, Incremental checkpointing for grids, Linux Symposium, p.201, 2009.