A. Avizienis, J. Laprie, B. Randell, and C. E. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Trans. Dependable Sec. Comput, vol.1, issue.1, pp.11-33, 2004.

L. Bautista-gomez, A. Benoit, A. Cavelan, S. K. Raina, Y. Robert et al., Which verification for soft error detection? In HiPC, 2015.

L. Bautista-gomez and F. Cappello, Detecting silent data corruption through data dynamic monitoring for scientific applications, PPoPP, 2014.

L. Bautista-gomez and F. Cappello, Detecting and correcting data corruption in stencil applications through multivariate interpolation, FTS. IEEE, 2015.

L. Bautista-gomez and F. Cappello, Exploiting spatial smoothness in HPC applications to detect silent data corruption, 2015.

A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert et al., Identifying the right replication level to detect and correct silent errors at scale, Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop. ACM, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02082907

A. Benoit, A. Cavelan, V. L. Fèvre, Y. Robert, and H. Sun, Towards optimal multi-level checkpointing, IEEE Transactions on Computers, vol.66, issue.7, pp.1212-1226, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02082416

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Assessing general-purpose algorithms to cope with fail-stop and silent errors, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01066664

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Optimal resilience patterns to cope with fail-stop and silent errors, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01215857

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical time-stepping schemes, Int. J. High Performance Computing Applications, 2014.

E. Berrocal, L. Bautista-gomez, S. Di, Z. Lan, and F. Cappello, Lightweight silent data corruption detection based on runtime data analysis for HPC applications, 2015.

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, J. Parallel Distrib. Comput, vol.69, issue.4, pp.410-416, 2009.

G. Bronevetsky and B. De-supinski, Soft error vulnerability of iterative linear algebra methods, ICS. ACM, 2008.

F. Cappello, E. M. Constantinescu, P. D. Hovland, T. Peterka, C. Phillips et al., Improving the trust in results of numerical simulations and scientific data analytics, 2015.

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience, Int. J. High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009.

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, issue.1, 2014.

H. Casanova, M. Bougeret, Y. Robert, F. Vivien, and D. Zaidouni, Using group replication for resilience on exascale systems, Int. Journal of High Performance Computing Applications, vol.28, issue.2, pp.210-224, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00881463

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Comp. Syst, vol.51, pp.7-19, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01199752

P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson, RAID: Highperformance, reliable secondary storage, ACM Comput. Surv, vol.26, issue.2, pp.145-185, 1994.

E. Ciocca, I. Koren, Z. Koren, C. M. Krishna, and D. S. Katz, Application-level fault tolerance in the orbital thermal imaging spectrometer, 2004.

S. P. Crago, D. I. Kang, M. Kang, R. Kost, K. Singh et al., Programming models and development software for a space-based many-core processor, 4th Int. Conf. onon Space Mission Challenges for Information Technology, pp.95-102, 2011.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006.

S. Di, M. S. Bouguerra, L. Bautista-gomez, and F. Cappello, Optimization of multi-level checkpoint model for large scale HPC applications, 2014.

J. Dongarra, The international exascale software project roadmap, Int. J. High Perform. Comput. Appl, vol.25, issue.1, pp.3-60, 2011.

J. Elliott, M. Hoemmen, and F. Mueller, Evaluating the impact of SDC on the GMRES iterative solver, IPDPS. IEEE, 2014.

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining partial redundancy and checkpointing for HPC, ICDCS, 2012.

E. Elnozahy and J. Plank, Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004.

C. Engelmann, H. H. Ong, and S. L. Scorr, The case for modular redundancy in large-scale high performance computing systems, 2009.

C. Engelmann and B. Swen, Redundant execution of HPC applications with MR-MPI, PDCN. IASTED, 2011.

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, SC'11, 2011.

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, 2012.

C. George and S. S. Vadhiyar, ADFT: An adaptive framework for fault tolerance on large scale systems using application malleability, Procedia Computer Science, vol.9, pp.166-175, 2012.

M. Heroux and M. Hoemmen, Fault-tolerant iterative methods via selective reliability. Research report SAND2011-3915 C, Sandia Nat. Lab, 2011.

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984.

T. Hérault and Y. Robert, Fault-Tolerance Techniques for High-Performance Computing, Computer Communications and Networks, 2015.

T. Leblanc, R. Anand, E. Gabriel, and J. Subhlok, VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009.

M. Li and P. P. Lee, Toward I/O-efficient protection against silent data corruptions in RAID arrays, 30th Symposium on Mass Storage Systems and Technologies (MSST), pp.1-12, 2014.

R. E. Lyons and W. Vanderkulk, The use of triple-modular redundancy to improve computer reliability, IBM J. Res. Dev, vol.6, issue.2, pp.200-209, 1962.

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, 2010.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, 2013.

T. O'gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, vol.41, issue.4, pp.553-557, 1994.

R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam, M. R. Varela et al., Modeling the impact of checkpoints on next-generation systems, 24th IEEE Conf. Mass Storage Systems and Technologies, 2007.

M. W. Rashid and M. C. Huang, Supporting highly-decoupled thread-level redundancy for parallel programs, 14th Int. Conf. on High-Performance Computer Architecture (HPCA), pp.393-404, 2008.

P. Sao and R. Vuduc, Self-stabilizing iterative solvers, ScalA '13, 2013.

B. Schroeder and G. A. Gibson, Understanding Failures in Petascale Computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007.

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault tolerant preconditioned conjugate gradient for sparse linear system solution, ICS, 2012.

M. Snir, Addressing failures in exascale computing, Int. J. High Perform. Comput. Appl, vol.28, issue.2, pp.129-173, 2014.

J. Stearley, K. B. Ferreira, D. J. Robinson, J. Laros, K. T. Pedretti et al., Does partial replication pay off? In FTXS, 2012.

O. Subasi, J. Arias, O. Unsal, J. Labarta, and A. Cristal, Programmer-directed partial redundancy for resilient HPC, Computing Frontiers, 2015.

S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho, Using replication and checkpointing for reliable task management in computational grids, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00788867

J. W. Young, A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, issue.9, pp.530-531, 1974.

J. Yu, D. Jian, Z. Wu, and H. Liu, Thread-level redundancy fault tolerant CMP based on relaxed input replication, 2011.

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, Cluster Computing, 2009.

J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin, IBM experiments in soft fails in computer electronics, Appendix A. Proof of Theorem, vol.40, issue.1, pp.3-18, 1996.