C. J. Anfinson and F. T. Luk, A linear algebraic model of algorithm-based fault tolerance, IEEE Transactions on Computers, vol.37, issue.12, pp.1599-1604, 1988.
DOI : 10.1109/12.9736

G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni, Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014.
DOI : 10.1016/j.jpdc.2013.10.010

URL : https://hal.archives-ouvertes.fr/hal-00788313

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Assessing generalpurpose algorithms to cope with fail-stop and silent errors, Workshop on Performance Modeling, Benchmarking and Simulation (PMBS), 2014.
URL : https://hal.archives-ouvertes.fr/hal-01358146

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical time-stepping schemes, International Journal of High Performance Computing Applications, vol.29, issue.4, 1312.
DOI : 10.1177/1094342014532297

M. Benzi, R. Kouhia, and M. T?ma, Stabilized and block approximate inverse preconditioners for problems in solid and structural mechanics, Computer Methods in Applied Mechanics and Engineering, vol.190, issue.49-50, pp.49-506533, 2001.
DOI : 10.1016/S0045-7825(01)00235-3

M. Benzi and M. Tuma, A Sparse Approximate Inverse Preconditioner for Nonsymmetric Linear Systems, SIAM Journal on Scientific Computing, vol.19, issue.3, pp.968-994, 1998.
DOI : 10.1137/S1064827595294691

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

P. Bridges, K. B. Ferreira, M. A. Heroux, and M. Hoemmen, Fault-tolerant linear solvers via selective reliability. preprint, 2012.

G. Bronevetsky and . Bronis-de-supinski, Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008.
DOI : 10.1145/1375527.1375552

F. Cappello, Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities, International Journal of High Performance Computing Applications, vol.23, issue.3, pp.212-226, 2009.
DOI : 10.1177/1094342009106189

F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer et al., Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009.
DOI : 10.1177/1094342009347767

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience, 2014. Open Access, p.14
DOI : 10.1177/1094342009347767

Z. Chen, Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods, Proc. 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pp.167-176, 2013.

R. K. Fan and . Chung, Spectral Graph Theory, 1997.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

T. A. Davis and Y. Hu, The university of Florida sparse matrix collection, ACM Transactions on Mathematical Software, vol.38, issue.1, pp.1-125, 2011.
DOI : 10.1145/2049662.2049663

P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra, Algorithm-based fault tolerance for dense matrix factorizations, PPoPP, pp.225-234, 2012.

P. Du, P. Luszczek, S. Tomov, and J. Dongarra, Soft error resilient QR factorization for hybrid system with GPGPU, Scalable Algorithms for Large-Scale Systems Workshop (ScalA2011), pp.457-464, 2011.
DOI : 10.1016/j.jocs.2013.01.004

J. Elliott, F. Mueller, M. Stoyanov, and C. Webster, Quantifying the impact of single bit flips on floating point arithmetic. preprint, 2013.

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012.
DOI : 10.1109/ICDCS.2012.56

M. Fasi, Y. Robert, and B. Uçar, Combining Algorithm-based Fault Tolerance and Checkpointing for Iterative Solvers, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01111707

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale highperformance computing, Proc. of the ACM/IEEE SC Int. Conf., SC '12, 2012.

D. Hakkarinen, P. Wu, and Z. Chen, Fail-stop failure algorithm-based fault tolerance for cholesky decomposition. Parallel and Distributed Systems, IEEE Transactions on, vol.26, issue.5, pp.1323-1335, 2015.

M. A. Heroux and M. Hoemmen, Fault-tolerant iterative methods via selective reliability, Research report SAND2011-3915 C, Sandia National Laboratories, 2011.

N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2002.
DOI : 10.1137/1.9780898718027

N. J. Higham, Functions of Matrices: Theory and Computation, 2008.
DOI : 10.1137/1.9780898717778

M. Hoemmen and M. A. Heroux, Fault-tolerant iterative methods via selective reliability, 2011.

M. Hoemmen and M. A. Heroux, Fault-Tolerant Iterative Methods via Selective Reliability, Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), p.9, 2011.

K. Huang and J. A. Abraham, Algorithm-Based Fault Tolerance for, Matrix Operations . Computers, IEEE Transactions, vol.33, issue.6, pp.518-528, 1984.

A. A. Hwang, I. A. Stefanovici, and B. Schroeder, Cosmic rays don't strike twice, ACM SIGARCH Computer Architecture News, vol.40, issue.1, pp.111-122, 2012.
DOI : 10.1145/2189750.2150989

K. Kaya, B. Uçar, and V. , Analysis of Partitioning Models and Metrics in Parallel Sparse Matrix-Vector Multiplication, Parallel Processing and Applied Mathematics (PPAM2014), pp.174-184, 2014.
DOI : 10.1007/978-3-642-55195-6_16

URL : https://hal.archives-ouvertes.fr/hal-00821523

G. Lu, Z. Zheng, and A. A. Chien, When is multi-version checkpointing needed?, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13, 2013.
DOI : 10.1145/2465813.2465821

R. E. Lyons and W. Vanderkulk, The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

M. Mitzenmacher and E. Upfal, Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005.
DOI : 10.1017/CBO9780511813603

A. Moody, G. Bronevetsky, K. Mohror, and B. R. De-supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, High Performance Computing , Networking, Storage and Analysis (SC), 2010 International Conference for, pp.1-11, 2010.

Y. Saad, Iterative Methods for Sparse Linear Systems, 2003.
DOI : 10.1137/1.9780898718003

P. Sao and R. Vuduc, Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013.
DOI : 10.1145/2530268.2530272

B. Schroeder and G. A. Gibson, Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, p.78, 2007.
DOI : 10.1088/1742-6596/78/1/012022

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78, 2012.
DOI : 10.1145/2304576.2304588

J. Sloan, R. Kumar, and G. Bronevetsky, Algorithmic approaches to low overhead fault detection for sparse linear algebra, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012), pp.1-12, 2012.
DOI : 10.1109/DSN.2012.6263938

M. Stoyanov and C. Webster, Quantifying the impact of single bit flips on floating point arithmetic, 2013.

E. Yao, J. Zhang, M. Chen, G. Tan, and N. Sun, Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance, International Journal of High Performance Computing Applications, vol.29, issue.4, pp.422-436, 2015.
DOI : 10.1177/1094342015578487

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115