M. Fasi, Y. Robert, and B. Uçar, Combining Backward and Forward Recovery to Cope with Silent Errors in Iterative Solvers, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pp.980-989, 2015.
DOI : 10.1109/IPDPSW.2015.22
URL : https://hal.archives-ouvertes.fr/hal-01159679

A. Moody, G. Bronevetsky, K. Mohror, and B. De-supinski, Design, modeling , and evaluation of a scalable multi-level checkpointing system, High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, pp.1-11, 2010.
DOI : 10.2172/984082

F. Cappello, Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities, International Journal of High Performance Computing Applications, vol.23, issue.3, pp.212-226, 2009.
DOI : 10.1177/1094342009106189

F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer et al., Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009.
DOI : 10.1177/1094342009347767

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience, open Access, p.14
DOI : 10.1177/1094342009347767

B. Schroeder and G. A. Gibson, Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, 12022.
DOI : 10.1088/1742-6596/78/1/012022

Z. Chen, Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods, Proc. 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pp.167-176, 2013.

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Assessing general-purpose algorithms to cope with fail-stop and silent errors, Performance Modeling, Benchmarking and Simulation (PMBS), 2014.
URL : https://hal.archives-ouvertes.fr/hal-01358146

K. Huang and J. A. Abraham, Algorithm-Based Fault Tolerance for Matrix Operations, Computers, IEEE Transactions on C?, vol.33, issue.6, pp.518-528, 1984.

P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra, Algorithm-based fault tolerance for dense matrix factorizations, pp.225-234, 2012.
DOI : 10.1145/2370036.2145845
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.498.9892

E. Yao, J. Zhang, M. Chen, G. Tan, and N. Sun, Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance, International Journal of High Performance Computing Applications, vol.29, issue.4, pp.422-436, 2015.
DOI : 10.1177/1094342015578487

D. Hakkarinen, P. Wu, and Z. Chen, Fail-stop failure algorithm-based fault tolerance for cholesky decomposition, Parallel and Distributed Systems, IEEE Transactions on, vol.26, issue.5, pp.1323-1335, 2015.
DOI : 10.1109/tpds.2014.2320502

P. Du, P. Luszczek, S. Tomov, and J. Dongarra, Soft error resilient QR factorization for hybrid system with GPGPU, scalable Algorithms for Large-Scale Systems Workshop (ScalA2011), pp.457-464, 2011.
DOI : 10.1016/j.jocs.2013.01.004

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78, 2012.
DOI : 10.1145/2304576.2304588

M. Benzi and M. Tuma, A Sparse Approximate Inverse Preconditioner for Nonsymmetric Linear Systems, SIAM Journal on Scientific Computing, vol.19, issue.3, pp.968-994, 1998.
DOI : 10.1137/S1064827595294691

K. Kaya, B. Uçar, and U. V. , Analysis of Partitioning Models and Metrics in Parallel Sparse Matrix-Vector Multiplication, Parallel Processing and Applied Mathematics (PPAM2014), pp.174-184, 2014.
DOI : 10.1007/978-3-642-55195-6_16
URL : https://hal.archives-ouvertes.fr/hal-00821523

G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni, Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014.
DOI : 10.1016/j.jpdc.2013.10.010
URL : https://hal.archives-ouvertes.fr/hal-00788313

R. E. Lyons and W. Vanderkulk, The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale highperformance computing, Proc. of the ACM/IEEE SC Int. Conf., SC '12, 2012.

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012.
DOI : 10.1109/ICDCS.2012.56

G. Lu, Z. Zheng, and A. A. Chien, When is multi-version checkpointing needed?, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13, 2013.
DOI : 10.1145/2465813.2465821

A. A. Hwang, I. A. Stefanovici, and B. Schroeder, Cosmic rays don't strike twice, ACM SIGARCH Computer Architecture News, vol.40, issue.1, pp.111-122, 2012.
DOI : 10.1145/2189750.2150989

M. Hoemmen and M. A. Heroux, Fault-tolerant iterative methods via selective reliability, Tech. rep., Sandia Corporation, 2011.

M. Hoemmen and M. A. Heroux, Fault-Tolerant Iterative Methods via Selective Reliability Storage and Analysis (SC), Proceedings of the 2011 International Conference for High Performance Computing, Networking, p.9, 2011.

P. G. Bridges, K. B. Ferreira, M. A. Heroux, and M. Hoemmen, Fault-tolerant linear solvers via selective reliability, p.preprint, 2012.

P. Sao and R. Vuduc, Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013.
DOI : 10.1145/2530268.2530272

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical time-stepping schemes, International Journal of High Performance Computing Applications, vol.29, issue.4
DOI : 10.1177/1094342014532297

M. Heroux and M. Hoemmen, Fault-tolerant iterative methods via selective reliability, Research report SAND2011-3915 C, Sandia National Laboratories, 2011.

G. Bronevetsky and B. De-supinski, Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008.
DOI : 10.1145/1375527.1375552

C. Anfinson and F. Luk, A linear algebraic model of algorithm-based fault tolerance, IEEE Transactions on Computers, vol.37, issue.12, pp.1599-1604, 1988.
DOI : 10.1109/12.9736

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

J. Sloan, R. Kumar, and G. Bronevetsky, Algorithmic approaches to low overhead fault detection for sparse linear algebra, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012), pp.1-12, 2012.
DOI : 10.1109/DSN.2012.6263938

Y. Saad, Iterative Methods for Sparse Linear Systems, 2003.
DOI : 10.1137/1.9780898718003

F. R. Chung, Spectral Graph Theory, 1997.
DOI : 10.1090/cbms/092

A. Edelman, Eigenvalues and Condition Numbers of Random Matrices, SIAM Journal on Matrix Analysis and Applications, vol.9, issue.4, pp.543-560, 1988.
DOI : 10.1137/0609045

Z. Chen and J. J. Dongarra, Condition Numbers of Gaussian Random Matrices, SIAM Journal on Matrix Analysis and Applications, vol.27, issue.3, pp.603-620, 2005.
DOI : 10.1137/040616413

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011.
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504

M. Mitzenmacher and E. , Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005.
DOI : 10.1017/CBO9780511813603

T. A. Davis and Y. Hu, The university of Florida sparse matrix collection, ACM Transactions on Mathematical Software, vol.38, issue.1, pp.1-1, 2011.
DOI : 10.1145/2049662.2049663

M. Benzi, R. Kouhia, and M. T?ma, Stabilized and block approximate inverse preconditioners for problems in solid and structural mechanics, Computer Methods in Applied Mechanics and Engineering, vol.190, issue.49-50, pp.49-50, 2001.
DOI : 10.1016/S0045-7825(01)00235-3

N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2002.
DOI : 10.1137/1.9780898718027

M. Fasi, Y. Robert, and B. Uçar, Combining Algorithm-based Fault Tolerance and Checkpointing for Iterative Solvers, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01111707

N. J. Higham, Functions of Matrices: Theory and Computation, 2008.
DOI : 10.1137/1.9780898717778

J. Elliott, F. Mueller, M. Stoyanov, and C. Webster, Quantifying the impact of single bit flips on floating point arithmetic, p.preprint, 2013.
DOI : 10.2172/1089338

M. Stoyanov and C. Webster, Quantifying the impact of single bit flips on floating point arithmetic, 2013.