J. Dongarra, P. Beckman, P. Aerts, F. Cappello, T. Lippert et al., The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009.
DOI : 10.1177/1094342009347714

V. Sarkar, Exascale software study: Software challenges in extreme scale systems, 2009.

S. Ashby, The opportunities and challenges of exascale computing 2010, white paper available at: http://science.energy.gov

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063443

A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, MPICH-V: a multiprotocol fault tolerant MPI, IJHPCA, vol.20, issue.3, pp.319-333, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00688637

S. Rao, L. Alvisi, H. M. Viny, and D. C. Sciences, Egida: an extensible toolkit for low-overhead fault-tolerance, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352), pp.48-55, 1999.
DOI : 10.1109/FTCS.1999.781033

K. M. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, Transactions on Computer Systems, pp.63-75, 1985.
DOI : 10.1145/214451.214456

S. Rao, L. Alvisi, and H. M. Vin, The cost of recovery in message logging protocols, 17th Symposium on Reliable Distributed Systems (SRDS)

A. Bouteiller, T. Herault, G. Bosilca, and J. J. Dongarra, Correlated Set Coordination in Fault Tolerant Message Logging Protocols, Proc. of Euro- Par'11 (II), ser. LNCS, pp.51-64, 2011.
DOI : 10.1007/978-3-642-23397-5_6

A. Guermouche, T. Ropars, M. Snir, and F. Cappello, HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012.
DOI : 10.1109/IPDPS.2012.111

URL : https://hal.archives-ouvertes.fr/hal-01121941

C. L. , E. Meneses, and L. V. Kalé, Team-based message logging: Preliminary results, Workshop Resilience in Clusters, Clouds, and Grids, 2010.
DOI : 10.1109/ccgrid.2010.110

URL : http://charm.cs.illinois.edu/newPapers/10-02/paper.pdf

J. S. Plank, Efficient Checkpointing on MIMD Architectures, 1993.

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

Y. Ling, J. Mi, and X. Lin, A variational calculus approach to optimal checkpoint placement, IEEE Trans. Computers, pp.699-708, 2001.

T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle, IEEE Transactions on Dependable and Secure Computing, vol.3, issue.2, pp.130-140, 2006.
DOI : 10.1109/TDSC.2006.22

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

J. S. Plank and M. G. Thomason, Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems, Journal of Parallel and Distributed Computing, vol.61, issue.11, p.1590, 2001.
DOI : 10.1006/jpdc.2001.1757

H. Jin, Y. Chen, H. Zhu, and X. Sun, Optimizing HPC Fault-Tolerant Environment: An Analytical Approach, 2010 39th International Conference on Parallel Processing, pp.525-534, 2010.
DOI : 10.1109/ICPP.2010.80

L. Wang, P. Karthik, Z. Kalbarczyk, R. Iyer, L. Votta et al., Modeling Coordinated Checkpointing for Large-Scale Supercomputers, 2005 International Conference on Dependable Systems and Networks (DSN'05), pp.812-821, 2005.
DOI : 10.1109/DSN.2005.67

R. Oldfield, S. Arunagiri, P. Teller, S. Seelam, M. Varela et al., Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), pp.30-46, 2007.
DOI : 10.1109/MSST.2007.4367962

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, pp.1-9, 2009.
DOI : 10.1109/CLUSTR.2009.5289177

M. Bouguerra, D. Trystram, and F. Wagner, Complexity Analysis of Checkpoint Scheduling with Variable Costs, IEEE Transactions on Computers, vol.62, issue.6, 2012.
DOI : 10.1109/TC.2012.57

URL : https://hal.archives-ouvertes.fr/hal-00788101

M. Wu, X. Sun, and H. Jin, Performance under failures of high-end computing, Proceedings of the 2007 ACM/IEEE conference on Supercomputing , SC '07, 2007.
DOI : 10.1145/1362622.1362687

T. Heath, R. P. Martin, and T. D. Nguyen, Improving cluster availability using workstation validation, ACM SIGMETRICS Performance Evaluation Review, vol.30, issue.1, pp.217-227, 2002.
DOI : 10.1145/511399.511362

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8437

B. Schroeder and G. A. Gibson, A large-scale study of failures in highperformance computing systems, Proc. of DSN, 2006.

Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun et al., An optimal checkpoint/restart model for a large scale high performance computing system, IPDPS'08, pp.1-9, 2008.

E. Heien, D. Kondo, A. Gainaru, D. Lapine, B. Kramer et al., Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063444

L. G. Valiant, A bridging model for parallel computation, Communications of the ACM, vol.33, issue.8, pp.103-111, 1990.
DOI : 10.1145/79173.79181

A. Bouteiller, G. Bosilca, and J. Dongarra, Redesigning the message logging model for high performance, Concurrency and Computation: Practice and Experience, pp.2196-2211, 2010.
DOI : 10.1002/cpe.1589

M. Bougeret, H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, Using group replication for resilience on exascale systems, International Journal of High Performance Computing Applications, vol.28, issue.2, 2012.
DOI : 10.1177/1094342013505348

URL : https://hal.archives-ouvertes.fr/hal-00881463

L. E. Cannon, A cellular computer to implement the Kalman filter algorithm, 1969.

S. Sumimoto, An Overview of Fujitsu's Lustre Based File System Lustre Filesystem Users' Group Meeting, 2011.

R. N°-7950 and R. Centre-grenoble-?-rhône-alpes, Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399