J. Ansel, K. Arya, and G. Cooperman, DMTCP: Transparent checkpointing for cluster computations and the desktop, 2009 IEEE International Symposium on Parallel & Distributed Processing, pp.1-12, 2009.
DOI : 10.1109/IPDPS.2009.5161063

J. Cao, G. Kerr, K. Arya, and G. Cooperman, Transparent checkpoint-restart over Infini- Band, ACM Symposium on High Performance Parallel and and Distributed Computing, 2009.
DOI : 10.1145/2600212.2600219

URL : http://arxiv.org/abs/1312.3938

K. Keahey, R. Figueiredo, J. Fortes, T. Freeman, and M. Tsugawa, Science clouds: Early experiences in cloud computing for scientific applications, Cloud computing and applications, pp.825-830, 2008.

D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman et al., The Eucalyptus Open-Source Cloud-Computing System, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pp.124-131, 2009.
DOI : 10.1109/CCGRID.2009.93

D. Miloji?i?, I. M. Llorente, and R. S. Montero, OpenNebula: A Cloud Management Tool, IEEE Internet Computing, vol.15, issue.2, pp.11-14, 2011.
DOI : 10.1109/MIC.2011.44

E. Feller, L. Rilling, and C. Morin, Snooze: A Scalable and Autonomic Virtual Machine Management Framework for Private Clouds, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), 2012.
DOI : 10.1109/CCGrid.2012.71

URL : https://hal.archives-ouvertes.fr/hal-00651542

P. Marshall, K. Keahey, and T. Freeman, Improving Utilization of Infrastructure Clouds, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp.205-214, 2011.
DOI : 10.1109/CCGrid.2011.56

T. Wood, K. K. Ramakrishnan, P. Shenoy, and J. Van-der-merwe, CloudNet, ACM SIGPLAN Notices, vol.46, issue.7, pp.121-132, 2011.
DOI : 10.1145/2007477.1952699

D. Ghoshal and L. Ramakrishnan, FRIEDA: Flexible Robust Intelligent Elastic Data Management in Cloud Environments, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp.1096-1105, 2012.
DOI : 10.1109/SC.Companion.2012.132

S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn, Ceph: A scalable, high-performance distributed file system, Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, pp.307-320, 2006.

G. Cooperman, J. Ansel, and X. Ma, Adaptive checkpointing for master-worker style parallelism (extended abstract), Proc. of 2005 IEEE Computer Society International Conference on Cluster Computing, 2005.

P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, Zookeeper: Wait-free coordination for internet-scale systems, USENIX Annual Technical Conference, p.9, 2010.

. Restlet, RESTful web framework for java

P. O. Fatoohi, T. A. Frederickson, R. S. Lasinski, and . Schreiber, The NAS parallel benchmarks, International Journal of High Performance Computing Applications, vol.5, issue.3, pp.63-73, 1991.

I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen, A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems, The Journal of Supercomputing, vol.6, issue.5, pp.1302-1326, 2013.
DOI : 10.1007/s11227-013-0884-0

]. J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine, The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI, 2007 IEEE International Parallel and Distributed Processing Symposium, 2007.
DOI : 10.1109/IPDPS.2007.370605

S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, A. Lumsdaine et al., The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing, International Journal of High Performance Computing Applications, vol.19, issue.4, pp.479-493, 2005.
DOI : 10.1177/1094342005056139

Q. Gao, W. Yu, W. Huang, and D. K. Panda, Application-transparent checkpoint/restart for MPI programs over InfiniBand, ICPP '06: Proceedings of the 2006 International Conference on Parallel Processing, pp.471-478, 2006.

A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI, International Journal of High Performance Computing Applications, vol.20, issue.3, pp.319-333, 2006.
DOI : 10.1177/1094342006067469

URL : https://hal.archives-ouvertes.fr/hal-00688637

P. Hargrove and J. Duell, Berkeley lab checkpoint/restart (BLCR) for Linux clusters, Journal of Physics: Conference Series, vol.46, pp.494-499, 2006.
DOI : 10.1088/1742-6596/46/1/067

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

A. Tchana, L. Broto, and D. Hagimont, Approaches to cloud computing fault tolerance, 2012 International Conference on Computer, Information and Telecommunication Systems (CITS), pp.1-6, 2012.
DOI : 10.1109/CITS.2012.6220386

W. Zhao, P. Melliar-smith, and L. Moser, Fault Tolerance Middleware for Cloud Computing, 2010 IEEE 3rd International Conference on Cloud Computing, pp.67-74, 2010.
DOI : 10.1109/CLOUD.2010.26

I. Egwutuoha, S. Chen, D. Levy, and B. Selic, A Fault Tolerance Framework for High Performance Computing in Cloud, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), pp.709-710, 2012.
DOI : 10.1109/CCGrid.2012.80

S. Di, Y. Robert, F. Vivien, D. Kondo, C. Wang et al., Optimization of cloud task processing with checkpoint-restart mechanism Storage and Analysis, ser. SC '13, Proceedings of the International Conference on High Performance Computing, Networking, pp.1-6412, 2013.

B. Nicolae and F. Cappello, BlobCR, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.341-3412, 2011.
DOI : 10.1145/2063384.2063429

URL : https://hal.archives-ouvertes.fr/inria-00601865

A. Kangarlou, P. Eugster, and D. Xu, VNsnap: Taking Snapshots of Virtual Networked Infrastructures in the Cloud, Services Computing, pp.484-496, 2012.
DOI : 10.1109/TSC.2011.29

R. Garg, K. Sodha, Z. Jin, and G. Cooperman, Checkpoint-restart for a network of virtual machines, 2013 IEEE International Conference on Cluster Computing (CLUSTER), 2013.
DOI : 10.1109/CLUSTER.2013.6702626

B. Schroeder and G. Gibson, A large-scale study of failures in high-performance computing systems Dependable and Secure Computing, IEEE Transactions on, vol.7, issue.4, pp.337-350, 2010.

N. Xiong, A. Vasilakos, J. Wu, Y. Yang, A. Rindos et al., A Self-tuning Failure Detection Scheme for Cloud Computing Service, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp.668-679, 2012.
DOI : 10.1109/IPDPS.2012.126

T. Ropars, E. Jeanvoine, and C. Morin, GAMoSe: An Accurate Monitoring Service For Grid Applications, Sixth International Symposium on Parallel and Distributed Computing (ISPDC'07), pp.40-40, 2007.
DOI : 10.1109/ISPDC.2007.23

URL : https://hal.archives-ouvertes.fr/inria-00424023