L. , B. Gomez, and F. Cappello, Detecting silent data corruption through data dynamic monitoring for scientific applications, SIGPLAN Notices, vol.49, issue.8, pp.381-382, 2014.

L. , B. Gomez, and F. Cappello, Detecting and correcting data corruption in stencil applications through multivariate interpolation, Proc. FTS'15, 2015.

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063427

URL : https://hal.archives-ouvertes.fr/hal-00721216

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016.
DOI : 10.1109/IPDPS.2016.39

URL : https://hal.archives-ouvertes.fr/hal-01354886

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical time-stepping schemes, High Performance Computing Applications, 2014.
DOI : 10.1177/1094342014532297

E. Berrocal, L. Bautista-gomez, S. Di, Z. Lan, and F. Cappello, Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications, Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC '15, 2015.
DOI : 10.1145/2749246.2749253

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

G. Bosilca, Unified model for assessing checkpointing protocols at extreme-scale. Concurrency and Computation: Practice and Experience, pp.2772-2791, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00696154

G. Bronevetsky and B. De-supinski, Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, 2008.
DOI : 10.1145/1375527.1375552

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, p.2014
DOI : 10.1177/1094342009347767

K. M. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.
DOI : 10.1145/214451.214456

Z. Chen, Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, PPoPP, 2013.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

S. Di, M. S. Bouguerra, L. Bautista-gomez, and F. Cappello, Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014.
DOI : 10.1109/IPDPS.2014.122

J. Dongarra, T. Hérault, and Y. Robert, Performance and reliability trade-offs for the double checkpointing algorithm, International Journal of Networking and Computing, vol.4, issue.1, pp.23-41, 2014.
DOI : 10.15803/ijnc.4.1_23

URL : https://hal.archives-ouvertes.fr/hal-01091928

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp.615-626, 2012.
DOI : 10.1109/ICDCS.2012.56

E. Elnozahy and J. Plank, Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004.
DOI : 10.1109/TDSC.2004.15

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.441-4412, 2011.
DOI : 10.1145/2063384.2063443

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, Proc. SC'12, p.78, 2012.

M. Heroux and M. Hoemmen, Fault-tolerant iterative methods via selective reliability, Research report SAND2011-3915 C, Sandia National Laboratories, 2011.

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984.

H. Jin, Y. Chen, H. Zhu, and X. Sun, Optimizing HPC Fault-Tolerant Environment: An Analytical Approach, 2010 39th International Conference on Parallel Processing, 2010.
DOI : 10.1109/ICPP.2010.80

R. E. Lyons and W. Vanderkulk, The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, Proc. SC'10, pp.1-11, 2010.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013.
DOI : 10.1145/2503210.2503266

T. O. Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Transactions on Electron Devices, vol.41, issue.4, pp.553-557, 1994.
DOI : 10.1109/16.278509

J. Plank, K. Li, and M. Puening, Diskless checkpointing, IEEE Transactions on Parallel and Distributed Systems, vol.9, issue.10, pp.972-986, 1998.
DOI : 10.1109/71.730527

P. Sao and R. Vuduc, Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013.
DOI : 10.1145/2530268.2530272

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78, 2012.
DOI : 10.1145/2304576.2304588

L. Silva and J. Silva, Using two-level stable storage for efficient checkpointing, IEE Proceedings - Software, vol.145, issue.6, pp.198-202, 1998.
DOI : 10.1049/ip-sen:19982440

N. H. Vaidya, A case for two-level distributed recovery schemes, ACM SIGMETRICS Performance Evaluation Review, vol.23, issue.1, pp.64-73, 1995.
DOI : 10.1145/223586.223596

P. M. Widener, K. B. Ferreira, S. Levy, and N. Fabian, Canaries in a Coal Mine: Using Application-Level Checkpoints to Detect Memory Failures, Euro-Par'15: Parallel Processing Workshops. Springer LNCS 9523, 2015.
DOI : 10.1007/978-3-319-27308-2_54

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

G. Zheng, L. Shi, and L. V. Kale, FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, Proc. CLUSTER'04, pp.93-103, 2004.

Z. Zheng, L. Yu, and Z. Lan, Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart, IEEE Transactions on Computers, vol.64, issue.5, pp.1402-1415, 2015.
DOI : 10.1109/TC.2014.2317182

J. Ziegler, M. Nelson, J. Shell, R. Peterson, C. Gelderloos et al., Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits, vol.33, issue.2, pp.246-252, 1998.
DOI : 10.1109/4.658626

R. N°-8871 and R. Centre-grenoble-?-rhône-alpes, Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399