G. Aupy, A. Benoit, T. Hérault, Y. Robert, F. Vivien et al., On the Combination of Silent Error Detection and Checkpointing, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, pp.11-20, 2013.
DOI : 10.1109/PRDC.2013.10

URL : https://hal.archives-ouvertes.fr/hal-00836871

L. Bautista-gomez, A. Benoit, A. Cavelan, S. K. Raina, Y. Robert et al., Which Verification for Soft Error Detection?, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC), 2015.
DOI : 10.1109/HiPC.2015.26

URL : https://hal.archives-ouvertes.fr/hal-01252382

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Assessing general-purpose algorithms to cope with fail-stop and silent errors, Proceedings of the 5th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2014.
URL : https://hal.archives-ouvertes.fr/hal-01358146

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
DOI : 10.1109/IPDPS.2016.39

URL : https://hal.archives-ouvertes.fr/hal-01354886

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Two-level checkpointing and partial verifications for linear task graphs, Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2015.
URL : https://hal.archives-ouvertes.fr/hal-01252400

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical timestepping schemes, The International Journal of High Performance Computing Applications, pp.10-1177, 2014.

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

G. Bronevetsky and B. De-supinski, Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008.
DOI : 10.1145/1375527.1375552

A. Cavelan, S. K. Raina, Y. Robert, and H. Sun, Assessing the Impact of Partial Verifications against Silent Data Corruptions, 2015 44th International Conference on Parallel Processing, 2015.
DOI : 10.1109/ICPP.2015.53

URL : https://hal.archives-ouvertes.fr/hal-01253493

K. M. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.
DOI : 10.1145/214451.214456

Z. Chen, Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, Proceedings of the 18th Symposium on Principles and Practice of Parallel Programming, pp.167-176, 2013.

F. R. Chung, Spectral Graph Theory, 1997.
DOI : 10.1090/cbms/092

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

J. Dongarra, The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009.
DOI : 10.1177/1094342009347714

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp.615-626, 2012.
DOI : 10.1109/ICDCS.2012.56

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

C. Engelmann, H. H. Ong, and S. L. Scorr, The case for modular redundancy in largescale highh performance computing systems, Proceeding of the 8th IASTED Infernational Conference on Parallel and Distributed Computing and Networks (PDCN), pp.189-194, 2009.

M. Fasi, J. Langou, Y. Robert, and B. Uçar, A backward/forward recovery approach for the preconditioned conjugate gradient method, Journal of Computational Science, vol.17
DOI : 10.1016/j.jocs.2016.04.008

URL : https://hal.archives-ouvertes.fr/hal-01354682

M. Fasi, Y. Robert, and B. Uçar, Combining Algorithm-based Fault Tolerance and Checkpointing for Iterative Solvers Short version appears in the proceedings of PDSEC, 2015.

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.441-4412, 2011.
DOI : 10.1145/2063384.2063443

M. Heroux and M. Hoemmen, Fault-tolerant iterative methods via selective reliability, Research report SAND2011-3915 C, Sandia National Laboratories, 2011.

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Transactions on Computers, vol.33, issue.6, pp.518-528, 1984.

A. A. Hwang, I. A. Stefanovici, and B. Schroeder, Cosmic rays don't strike twice, ACM SIGARCH Computer Architecture News, vol.40, issue.1, pp.111-122, 2012.
DOI : 10.1145/2189750.2150989

G. Lu, Z. Zheng, and A. A. Chien, When is multi-version checkpointing needed?, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13, pp.49-56, 2013.
DOI : 10.1145/2465813.2465821

R. E. Lyons and W. Vanderkulk, The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

M. Mitzenmacher and E. Upfal, Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005.
DOI : 10.1017/CBO9780511813603

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10), 2010.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13
DOI : 10.1145/2503210.2503266

T. O. Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Transactions on Electron Devices, vol.41, issue.4, pp.553-557, 1994.
DOI : 10.1109/16.278509

T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle, IEEE Transactions on Dependable and Secure Computing, vol.3, issue.2, pp.130-140, 2006.
DOI : 10.1109/TDSC.2006.22

Y. Saad, Iterative methods for sparse linear systems, 2003.
DOI : 10.1137/1.9780898718003

P. Sao and R. Vuduc, Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013.
DOI : 10.1145/2530268.2530272

B. Schroeder and G. Gibson, Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007.
DOI : 10.1088/1742-6596/78/1/012022

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78, 2012.
DOI : 10.1145/2304576.2304588

S. Toueg and . Babaoglu, On the Optimum Checkpoint Selection Problem, SIAM Journal on Computing, vol.13, issue.3, pp.630-649, 1984.
DOI : 10.1137/0213039

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009.
DOI : 10.1109/CLUSTR.2009.5289177

J. Ziegler, H. Muhlfeld, C. Montrose, H. Curtis, T. O. Gorman et al., Accelerated testing for cosmic soft-error rate, IBM Journal of Research and Development, vol.40, issue.1, pp.51-72, 1996.
DOI : 10.1147/rd.401.0051

J. Ziegler, M. Nelson, J. Shell, R. Peterson, C. Gelderloos et al., Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits, vol.33, issue.2, pp.246-252, 1998.
DOI : 10.1109/4.658626

J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin, IBM experiments in soft fails in computer electronics (1978???1994), IBM Journal of Research and Development, vol.40, issue.1, pp.3-18, 1996.
DOI : 10.1147/rd.401.0003