A. Avizienis, J. Laprie, B. Randell, and C. E. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.1, pp.11-33, 2004.
DOI : 10.1109/TDSC.2004.2

L. Bautista-gomez, A. Benoit, A. Cavelan, S. K. Raina, Y. Robert et al., Which verification for soft error detection? In HiPC, 2015.

L. , B. Gomez, and F. Cappello, Detecting silent data corruption through data dynamic monitoring for scientific applications, PPoPP. ACM, 2014.

L. , B. Gomez, and F. Cappello, Detecting and correcting data corruption in stencil applications through multivariate interpolation, FTS. IEEE, 2015.

L. , B. Gomez, and F. Cappello, Exploiting spatial smoothness in HPC applications to detect silent data corruption, HPCC. IEEE, 2015.

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Assessing general-purpose algorithms to cope with fail-stop and silent errors, PMBS. ACM, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01066664

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016.
DOI : 10.1109/IPDPS.2016.39

URL : https://hal.archives-ouvertes.fr/hal-01215857

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical time-stepping schemes, High Performance Computing Applications, 2014.
DOI : 10.1021/ct400489c

E. Berrocal, L. Bautista-gomez, S. Di, Z. Lan, and F. Cappello, Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications, Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC '15, 2015.
DOI : 10.1145/1810085.1810120

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

G. Bronevetsky and B. De-supinski, Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, 2008.
DOI : 10.1145/1375527.1375552

F. Cappello, E. M. Constantinescu, P. D. Hovland, T. Peterka, C. Phillips et al., Improving the trust in results of numerical simulations and scientific data analytics, 2015.
DOI : 10.2172/1179023

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience, The International Journal of High Performance Computing Applications, vol.29, issue.2, pp.374-388, 2009.
DOI : 10.1515/9781400882618-003

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience, The International Journal of High Performance Computing Applications, vol.29, issue.2, p.2014
DOI : 10.1515/9781400882618-003

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

H. Casanova, M. Bougeret, Y. Robert, F. Vivien, and D. Zaidouni, Using group replication for resilience on exascale systems, Int. Journal of High Performance Computing Applications, vol.28, issue.2, pp.210-224, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00668016

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Computer Systems, vol.51, pp.7-19, 2015.
DOI : 10.1016/j.future.2015.04.003

URL : https://hal.archives-ouvertes.fr/hal-01199752

E. Ciocca, I. Koren, Z. Koren, C. M. Krishna, and D. S. Katz, Application-level fault tolerance in the orbital thermal imaging spectrometer, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings., 2004.
DOI : 10.1109/PRDC.2004.1276551

S. P. Crago, D. I. Kang, M. Kang, R. Kost, K. Singh et al., Programming Models and Development Software for a Space-Based Many-Core Processor, 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology, pp.95-102, 2011.
DOI : 10.1109/SMC-IT.2011.29

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

S. Di, M. S. Bouguerra, L. Bautista-gomez, and F. Cappello, Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014.
DOI : 10.1109/IPDPS.2014.122

J. Dongarra, The International Exascale Software Project roadmap, The International Journal of High Performance Computing Applications, vol.25, issue.1, pp.3-60, 2011.
DOI : 10.2172/471364

J. Elliott, M. Hoemmen, and F. Mueller, Evaluating the Impact of SDC on the GMRES Iterative Solver, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014.
DOI : 10.1109/IPDPS.2014.123

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012.
DOI : 10.1109/ICDCS.2012.56

E. Elnozahy and J. Plank, Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004.
DOI : 10.1109/TDSC.2004.15

C. Engelmann, H. H. Ong, and S. L. Scorr, The case for modular redundancy in large-scale high performance computing systems, PDCN. IASTED, 2009.

C. Engelmann and B. Swen, Redundant Execution of HPC Applications with MR-MPI, Parallel and Distributed Computing and Networks / 720: Software Engineering, 2011.
DOI : 10.2316/P.2011.719-031

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063443

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, SC. ACM, 2012.

C. George and S. S. Vadhiyar, ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability, Procedia Computer Science, vol.9, pp.166-175, 2012.
DOI : 10.1016/j.procs.2012.04.018

M. Heroux and M. Hoemmen, Fault-tolerant iterative methods via selective reliability, 2011.

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984.

T. Leblanc, R. Anand, E. Gabriel, and J. Subhlok, VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009.
DOI : 10.1007/978-3-540-30218-6_19

R. E. Lyons and W. Vanderkulk, The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, SC. ACM, 2010.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013.
DOI : 10.1145/2503210.2503266

T. O. Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Transactions on Electron Devices, vol.41, issue.4, pp.553-557, 1994.
DOI : 10.1109/16.278509

R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam, M. R. Varela et al., Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), 2007.
DOI : 10.1109/MSST.2007.4367962

M. W. Rashid and M. C. Huang, Supporting highly-decoupled thread-level redundancy for parallel programs, 2008 IEEE 14th International Symposium on High Performance Computer Architecture, pp.393-404, 2008.
DOI : 10.1109/HPCA.2008.4658655

P. Sao and R. Vuduc, Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013.
DOI : 10.1145/2530268.2530272

B. Schroeder and G. A. Gibson, Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007.
DOI : 10.1088/1742-6596/78/1/012022

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, 2012.
DOI : 10.1145/2304576.2304588

M. Snir, Addressing failures in exascale computing, The International Journal of High Performance Computing Applications, vol.37, issue.13, pp.129-173, 2014.
DOI : 10.1016/j.anucene.2010.01.017

J. Stearley, K. B. Ferreira, D. J. Robinson, J. Laros, K. T. Pedretti et al., Does partial replication pay off? In FTXS, 2012.

O. Subasi, J. Arias, O. Unsal, J. Labarta, and A. , Programmer-directed partial redundancy for resilient HPC, Proceedings of the 12th ACM International Conference on Computing Frontiers, CF '15, 2015.
DOI : 10.1145/1321211.1321241

S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho, Using replication and checkpointing for reliable task management in computational Grids, 2010 International Conference on High Performance Computing & Simulation, 2010.
DOI : 10.1109/HPCS.2010.5547140

URL : https://hal.archives-ouvertes.fr/hal-00788867

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. Yu, D. Jian, Z. Wu, and H. Liu, Thread-level redundancy fault tolerant CMP based on relaxed input replication, ICCIT. IEEE, 2011.

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009.
DOI : 10.1109/CLUSTR.2009.5289177

J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin, IBM experiments in soft fails in computer electronics (1978???1994), Inria RESEARCH CENTRE GRENOBLE ? RHÔNE-ALPES Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria.fr ISSN, pp.3-18, 1996.
DOI : 10.1147/rd.401.0003