A. Avizienis, J. Laprie, B. Randell, and C. E. Landwehr, Basic Concepts and Taxonomy of Dependable and Secure Computing, IEEE Trans. Dependable Sec. Comput, vol.1, pp.11-33, 2004.

A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert et al., Identifying the right replication level to detect and correct silent errors at scale, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02082907

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithmbased fault tolerance applied to high performance computing, J. Parallel Distrib. Comput, vol.69, pp.410-416, 2009.

R. Brightwell, K. Ferreira, and R. R. , Transparent Redundant Computing with MPI, 2010.

F. Cappello, E. M. Constantinescu, P. D. Hovland, T. Peterka, C. Phillips et al., Improving the trust in results of numerical simulations and scientific data analytics, 2015.

F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer et al., Toward Exascale Resilience, Int. J. High Performance Computing Applications, vol.23, pp.374-388, 2009.

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, p.1, 2014.

H. Casanova, M. Bougeret, Y. Robert, F. Vivien, and D. Zaidouni, Using group replication for resilience on exascale systems, Int. Journal of High Performance Computing Applications, vol.28, pp.210-224, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00668016

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Comp. Syst, vol.51, pp.7-19, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01199752

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006.

S. Di, M. S. Bouguerra, L. Bautista-gomez, and F. Cappello, Optimization of multi-level checkpoint model for large scale HPC applications, 2014.

J. Dongarra, The International Exascale Software Project Roadmap, Int. J. High Perform. Comput. Appl, vol.25, pp.3-60, 2011.

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining partial redundancy and checkpointing for HPC, 2012.

E. Elnozahy and J. Plank, Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, pp.97-108, 2004.

C. Engelmann, H. H. Ong, and S. L. Scorr, The case for modular redundancy in large-scale highh performance computing systems, 2009.

C. Engelmann and B. Swen, Redundant execution of HPC applications with MR-MPI, 2011.

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the Viability of Process Replication Reliability for Exascale Systems, PSC'11, 2011.

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, p.78, 2012.

C. George and S. S. Vadhiyar, ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability, Procedia Computer Science, vol.9, pp.166-175, 2012.

K. Huang and J. A. Abraham, Algorithm-Based Fault Tolerance for Matrix Operations, IEEE Trans. Comput, vol.33, pp.518-528, 1984.

, Fault-Tolerance Techniques for High-Performance Computing, 2015.

T. Leblanc, R. Anand, E. Gabriel, and J. Subhlok, VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009.

R. E. Lyons and W. Vanderkulk, The use of triple-modular redundancy to improve computer reliability, IBM J. Res. Dev, vol.6, pp.200-209, 1962.

A. Moody, G. Bronevetsky, K. Mohror, and B. R. De-supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, 2010.

X. Ni, E. Meneses, N. Jain, V. Laxmikant, and . Kalé, ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, Proc. SC'13, 2013.

T. J. O'gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, vol.41, pp.553-557, 1994.

R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam, M. R. Varela et al., Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conf. Mass Storage Systems and Technologies, 2007.

B. Schroeder and G. Gibson, Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, p.1, 2007.

B. Schroeder and G. A. Gibson, Understanding Failures in Petascale Computers, Journal of Physics: Conference Series, vol.78, p.1, 2007.

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault Tolerant Preconditioned Conjugate Gradient for Sparse Linear System Solution, ICS, 2012.

M. Snir, Addressing Failures in Exascale Computing, Int. J. High Perform. Comput. Appl, vol.28, pp.129-173, 2014.

J. Stearley, K. B. Ferreira, D. J. Robinson, J. Laros, K. T. Pedretti et al., Does partial replication pay off, 2012.

O. Subasi, J. Arias, O. Unsal, J. Labarta, and A. Cristal, Programmer-directed Partial Redundancy for Resilient HPC, Computing Frontiers, 2015.

S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho, Using Replication and Checkpointing for Reliable Task Management in Computational Grids, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00788867

J. W. Young, A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, pp.530-531, 1974.

J. Yu, D. Jian, Z. Wu, and H. Liu, Thread-level redundancy fault tolerant CMP based on relaxed input replication, 2011.

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, Cluster Computing, 2009.

J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin, IBM Experiments in Soft Fails in Computer Electronics, IBM J. Res. Dev, vol.40, pp.3-18, 1996.