J. Dongarra, P. Beckman, P. Aerts, F. Cappello, T. Lippert et al., The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009.
DOI : 10.1177/1094342009347714

V. Sarkar, Exascale software study: Software challenges in extreme scale systems, 2009.

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

J. Shalf, S. Dosanjh, and J. Morrison, Exascale Computing Technology Challenges, VECPAR'10, the 9th Int. Conf. High Performance Computing for Computational Science, ser, pp.1-25, 2011.
DOI : 10.1109/MM.2009.5
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.185.3897

E. Meneses, O. Sarood, and L. V. Kalé, Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012.
DOI : 10.1109/SBAC-PAD.2012.12

K. M. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, Transactions on Computer Systems, pp.63-75, 1985.
DOI : 10.1145/214451.214456

G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra et al., Unified model for assessing checkpointing protocols at extreme-scale Concurrency and Computation: Practice and Experience, 2013.

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063443

G. Zheng, X. Ni, and L. V. Kalé, A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), p.2012
DOI : 10.1109/DSNW.2012.6264677

F. Cappello, H. Casanova, and Y. Robert, PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS, Parallel Processing Letters, vol.21, issue.02, pp.111-132, 2011.
DOI : 10.1142/S0129626411000126
URL : https://hal.archives-ouvertes.fr/hal-00945068

G. Zheng, L. Shi, and L. V. Kalé, FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, Proc. 2004 IEEE Int. Conf. Cluster Computing, 2004.

X. Ni, E. Meneses, and L. V. Kalé, Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm, 2012 IEEE International Conference on Cluster Computing, 2012.
DOI : 10.1109/CLUSTER.2012.82

J. Dongarra, T. Hérault, and Y. Robert, Revisiting the Double Checkpointing Algorithm, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, 2013.
DOI : 10.1109/IPDPSW.2013.11
URL : https://hal.archives-ouvertes.fr/hal-00925168

R. Rajachandrasekar, A. Moody, K. Mohror, and D. K. Panda, A 1 PB/s file system to checkpoint three million MPI tasks, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, HPDC '13, pp.143-154, 2013.
DOI : 10.1145/2493123.2462908

R. N°-8387 and R. Centre-grenoble-?-rhône-alpes, Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399