J. Ansel, K. Arya, and G. Cooperman, DMTCP: Transparent checkpointing for cluster computations and the desktop, 2009 IEEE International Symposium on Parallel & Distributed Processing
DOI : 10.1109/IPDPS.2009.5161063

K. Keahey, R. Figueiredo, J. Fortes, T. Freeman, and M. Tsugawa, Science clouds: Early experiences in cloud computing for scientific applications, Cloud computing and applications, pp.825-830, 2008.

D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman et al., The Eucalyptus Open-Source Cloud-Computing System, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pp.124-131, 2009.
DOI : 10.1109/CCGRID.2009.93

D. Miloji?i´miloji?i´c, I. M. Llorente, and R. S. Montero, OpenNebula: A Cloud Management Tool, IEEE Internet Computing, vol.15, issue.2, pp.11-14, 2011.
DOI : 10.1109/MIC.2011.44

E. Feller, L. Rilling, and C. Morin, Snooze: A Scalable and Autonomic Virtual Machine Management Framework for Private Clouds, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), 2012.
DOI : 10.1109/CCGrid.2012.71

URL : https://hal.archives-ouvertes.fr/hal-00651542

P. Marshall, K. Keahey, and T. Freeman, Improving Utilization of Infrastructure Clouds, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp.205-214, 2011.
DOI : 10.1109/CCGrid.2011.56

T. Wood, K. K. Ramakrishnan, P. Shenoy, and J. Van-der-merwe, CloudNet, ACM SIGPLAN Notices, vol.46, issue.7, pp.121-132, 2011.
DOI : 10.1145/2007477.1952699

D. Ghoshal and L. Ramakrishnan, FRIEDA: Flexible Robust Intelligent Elastic Data Management in Cloud Environments, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp.1096-1105, 2012.
DOI : 10.1109/SC.Companion.2012.132

S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn, Ceph: A scalable, high-performance distributed file system, Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, pp.307-320, 2006.

G. Cooperman, J. Ansel, and X. Ma, Adaptive checkpointing for master-worker style parallelism (extended abstract), Proc. of 2005 IEEE Computer Society International Conference on Cluster Computing, 2005.

P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, Zookeeper: Wait-free coordination for internet-scale systems, USENIX Annual Technical Conference, p.9, 2010.

. Restlet, RESTful web framework for java. http://www.restlet.org. [17] (2013) The Grid'5000 experimentation testbed

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter et al., The Nas Parallel Benchmarks, International Journal of High Performance Computing Applications, vol.5, issue.3, pp.63-73, 1991.
DOI : 10.1177/109434209100500306

I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen, A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems, The Journal of Supercomputing, vol.6, issue.5, pp.1302-1326, 2013.
DOI : 10.1007/s11227-013-0884-0

J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine, The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI, 2007 IEEE International Parallel and Distributed Processing Symposium, 2007.
DOI : 10.1109/IPDPS.2007.370605

S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, A. Lumsdaine et al., The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing, International Journal of High Performance Computing Applications, vol.19, issue.4, pp.479-493, 2005.
DOI : 10.1177/1094342005056139

Q. Gao, W. Yu, W. Huang, and D. K. Panda, Application-transparent checkpoint/restart for MPI programs over InfiniBand, ICPP '06: Proceedings of the 2006 International Conference on Parallel Processing, pp.471-478, 2006.

A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI, International Journal of High Performance Computing Applications, vol.20, issue.3, pp.319-333, 2006.
DOI : 10.1177/1094342006067469

URL : https://hal.archives-ouvertes.fr/hal-00688637

P. Hargrove and J. Duell, Berkeley lab checkpoint/restart (BLCR) for Linux clusters, Journal of Physics: Conference Series, vol.46, pp.494-499, 2006.
DOI : 10.1088/1742-6596/46/1/067

A. Tchana, L. Broto, and D. Hagimont, Approaches to cloud computing fault tolerance, 2012 International Conference on Computer, Information and Telecommunication Systems (CITS), pp.1-6, 2012.
DOI : 10.1109/CITS.2012.6220386

W. Zhao, P. Melliar-smith, and L. Moser, Fault Tolerance Middleware for Cloud Computing, 2010 IEEE 3rd International Conference on Cloud Computing, pp.67-74, 2010.
DOI : 10.1109/CLOUD.2010.26

I. Egwutuoha, S. Chen, D. Levy, and B. Selic, A Fault Tolerance Framework for High Performance Computing in Cloud, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), pp.709-710, 2012.
DOI : 10.1109/CCGrid.2012.80

S. Di, Y. Robert, F. Vivien, D. Kondo, C. Wang et al., Optimization of cloud task processing with checkpoint-restart mechanism Storage and Analysis, ser. SC '13, Proceedings of the International Conference on High Performance Computing, Networking, pp.1-6412, 2013.

B. Nicolae and F. Cappello, BlobCR, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.341-3412, 2011.
DOI : 10.1145/2063384.2063429

URL : https://hal.archives-ouvertes.fr/inria-00601865

A. Kangarlou, P. Eugster, and D. Xu, VNsnap: Taking Snapshots of Virtual Networked Infrastructures in the Cloud, Services Computing, pp.484-496, 2012.
DOI : 10.1109/TSC.2011.29

R. Garg, K. Sodha, Z. Jin, and G. Cooperman, Checkpoint-restart for a network of virtual machines, 2013 IEEE International Conference on Cluster Computing (CLUSTER), 2013.
DOI : 10.1109/CLUSTER.2013.6702626

B. Schroeder and G. Gibson, A large-scale study of failures in highperformance computing systems Dependable and Secure Computing, IEEE Transactions on, vol.7, issue.4, pp.337-350, 2010.

N. Xiong, A. Vasilakos, J. Wu, Y. Yang, A. Rindos et al., A Self-tuning Failure Detection Scheme for Cloud Computing Service, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp.668-679, 2012.
DOI : 10.1109/IPDPS.2012.126

T. Ropars, E. Jeanvoine, and C. Morin, GAMoSe: An Accurate Monitoring Service For Grid Applications, Sixth International Symposium on Parallel and Distributed Computing (ISPDC'07), pp.40-40, 2007.
DOI : 10.1109/ISPDC.2007.23

URL : https://hal.archives-ouvertes.fr/inria-00424023