J. Dongarra, P. Beckman, P. Aerts, F. Cappello, T. Lippert et al., The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009.
DOI : 10.1177/1094342009347714

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proc. of the ACM, pp.1-11, 2010.

G. Lu, Z. Zheng, and A. A. Chien, When is multi-version checkpointing needed, " in 3rd Workshop for Fault-tolerance at Extreme Scale (FTXS), 2013.

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer et al., Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009.
DOI : 10.1177/1094342009347767

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063443

R. E. Lyons and W. Vanderkulk, The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

Y. Ling, J. Mi, and X. Lin, A variational calculus approach to optimal checkpoint placement, IEEE Trans. on computers, pp.699-708, 2001.

T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle, IEEE Transactions on Dependable and Secure Computing, vol.3, issue.2, pp.130-140, 2006.
DOI : 10.1109/TDSC.2006.22

M. Bouguerra, T. Gautier, D. Trystram, and J. Vincent, A Flexible Checkpoint/Restart Model in Distributed Systems, PPAM, ser. LNCS, pp.206-215, 2010.
DOI : 10.1007/978-3-642-14390-8_22

URL : https://hal.archives-ouvertes.fr/hal-00788926

S. Toueg and O. Babaoglu, On the Optimum Checkpoint Selection Problem, SIAM Journal on Computing, vol.13, issue.3, pp.630-649, 1984.
DOI : 10.1137/0213039

M. Bouguerra, D. Trystram, and F. Wagner, Complexity Analysis of Checkpoint Scheduling with Variable Costs, IEEE Transactions on Computers, vol.62, issue.6, 2012.
DOI : 10.1109/TC.2012.57

URL : https://hal.archives-ouvertes.fr/hal-00788101

J. S. Plank and M. G. Thomason, Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems, Journal of Parallel and Distributed Computing, vol.61, issue.11, p.1590, 2001.
DOI : 10.1006/jpdc.2001.1757

H. Jin, Y. Chen, H. Zhu, and X. Sun, Optimizing HPC Fault-Tolerant Environment: An Analytical Approach, 2010 39th International Conference on Parallel Processing, pp.525-534, 2010.
DOI : 10.1109/ICPP.2010.80

L. Wang, P. Karthik, Z. Kalbarczyk, R. Iyer, L. Votta et al., Modeling Coordinated Checkpointing for Large-Scale Supercomputers, 2005 International Conference on Dependable Systems and Networks (DSN'05), pp.812-821, 2005.
DOI : 10.1109/DSN.2005.67

R. Oldfield, S. Arunagiri, P. Teller, S. Seelam, M. Varela et al., Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), pp.30-46, 2007.
DOI : 10.1109/MSST.2007.4367962

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009.
DOI : 10.1109/CLUSTR.2009.5289177

F. Cappello, H. Casanova, and Y. Robert, PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS, Parallel Processing Letters, vol.21, issue.02, pp.111-132, 2011.
DOI : 10.1142/S0129626411000126

URL : https://hal.archives-ouvertes.fr/hal-00945068

A. A. Hwang, I. A. Stefanovici, and B. Schroeder, Cosmic rays don't strike twice, ACM SIGARCH Computer Architecture News, vol.40, issue.1, pp.111-122, 2012.
DOI : 10.1145/2189750.2150989

G. Bronevetsky and B. De-supinski, Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008.
DOI : 10.1145/1375527.1375552

M. Heroux and M. Hoemmen, Fault-tolerant iterative methods via selective reliability, Sandia National Laboratories, 2011.

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Characterizing the impact of soft errors on iterative methods in scientific computing, Proceedings of the international conference on Supercomputing, ICS '11, pp.152-161, 2011.
DOI : 10.1145/1995896.1995922

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984.

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for largescale high-performance computing, Proc. of the ACM, 2012.

R. N°-8319 and R. Centre-grenoble-?-rhône-alpes, Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399