E. Agullo, L. Giraud, A. Guermouche, J. Roman, and M. Zounon, Towards resilient parallel linear Krylov solvers: recover-restart strategies, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00843992

E. Agullo, L. Giraud, P. Salas, and M. Zounon, On resiliency in some parallel eigensolvers, Research Report, vol.8625

L. Alvisi and K. Marzullo, Message logging: pessimistic, optimistic, causal, and optimal, IEEE Transactions on Software Engineering, vol.24, issue.2, pp.149-159, 1998.
DOI : 10.1109/32.666828

P. R. Amestoy, A. Guermouche, J. Excellent, and S. Pralet, Hybrid scheduling for the parallel solution of linear systems, Parallel Computing, vol.32, issue.2, pp.136-156, 2006.
DOI : 10.1016/j.parco.2005.07.004

URL : https://hal.archives-ouvertes.fr/hal-00358623

J. Anfinson and F. T. Luk, A linear algebraic model of algorithm-based fault tolerance, IEEE Transactions on Computers, vol.37, issue.12, pp.1599-1604, 1988.
DOI : 10.1109/12.9736

M. Todd and . Austin, DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design, Proceedings of the 32Nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 32, pp.196-207, 1999.

A. Borg, J. Baumbach, and S. Glazer, A message system supporting fault tolerance, ACM SIGOPS Operating Systems Review, vol.17, issue.5, pp.90-99, 1983.
DOI : 10.1145/773379.806617

A. Buttari, Fine-Grained Multithreading for the Multifrontal $QR$ Factorization of Sparse Matrices, SIAM Journal on Scientific Computing, vol.35, issue.4, pp.323-345, 2013.
DOI : 10.1137/110846427

URL : https://hal.archives-ouvertes.fr/hal-01122471

F. Cappello, H. Casanova, and Y. Robert, Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Parallel Processing Letters, pp.111-132, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00945068

Z. Chen, Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods, In ACM SIGPLAN Notices, vol.48, pp.167-176, 2013.

A. Timothy, Y. Davis, and . Hu, The University of Florida sparse matrix collection. j-TOMS, pp.1-125, 2011.

E. N. Mootaz, L. Elnozahy, Y. Alvisi, D. B. Wang, and . Johnson, A Survey of Rollback-recovery Protocols in Message-passing Systems, ACM Comput. Surv, vol.34, issue.3, pp.375-408, 2002.

E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel, The performance of consistent checkpointing, [1992] Proceedings 11th Symposium on Reliable Distributed Systems, pp.39-47, 1992.
DOI : 10.1109/RELDIS.1992.235144

A. George and E. Ng, On the Complexity of Sparse $QR$ and $LU$ Factorization of Finite-Element Matrices, SIAM Journal on Scientific and Statistical Computing, vol.9, issue.5, pp.849-861, 1988.
DOI : 10.1137/0909057

D. Göddeke, M. Altenbernd, and D. Ribbrock, Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing, Parallel Computing, vol.49, pp.117-135
DOI : 10.1016/j.parco.2015.07.003

J. A. Gunnels, R. A. Van-de-geijn, D. S. Katz, and E. S. Quintana-ortí, Fault-tolerant high-performance matrix multiplication: theory and practice, Proceedings International Conference on Dependable Systems and Networks, pp.47-56, 2001.
DOI : 10.1109/DSN.2001.941390

M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, Journal of Research of the National Bureau of Standards, vol.49, issue.6, pp.409-436, 1952.
DOI : 10.6028/jres.049.044

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, pp.518-528, 1984.

M. Huber, B. Gmeiner, U. Rüde, and B. I. Wohlmuth, Resilience for multigrid software at the extreme scale, 1506.

R. K. Iyer, N. M. Nakka, Z. T. Kalbarczyk, and S. Mitra, Recent Advances and New Avenues in Hardware-Level Reliability Support, IEEE Micro, vol.25, issue.6, pp.2518-2547, 2005.
DOI : 10.1109/MM.2005.119

B. David, W. Johnson, and . Zwaenepoel, Sender-based message logging, 1987.

J. Langou, Z. Chen, G. Bosilca, and J. Dongarra, Recovery Patterns for Iterative Methods in a Parallel Unstable Environment, SIAM Journal on Scientific Computing, vol.30, issue.1, pp.102-116, 2007.
DOI : 10.1137/040620394

C. J. Li and W. K. Fuchs, Catch-compiler-assisted techniques for checkpointing FTCS-20, Fault- Tolerant Computing Digest of Papers., 20th International Symposium, pp.74-81, 1990.

Y. Liu, R. Nassar, C. B. Leangsuksun, N. Naksinehaboon, M. Paun et al., An optimal checkpoint/restart model for a large scale high performance computing system, Parallel and Distributed Processing, pp.1-910, 2008.

C. Marc, B. Greg, L. Jesus, and V. Mateo, Dealing with faults in HPC systems, PMAA-International Workshop on Parallel Matrix Algorithms and Applications, 2014.

N. Oh, P. P. Shirvani, and E. J. Mccluskey, Error detection by duplicated instructions in superscalar processors. Reliability, IEEE Transactions on, vol.51, issue.1, pp.63-75, 2002.

C. C. Paige and M. A. Saunders, Solution of Sparse Indefinite Systems of Linear Equations, SIAM Journal on Numerical Analysis, vol.12, issue.4, pp.617-629, 1975.
DOI : 10.1137/0712047

J. S. Plank, Y. Kim, and J. Dongarra, Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing, Journal of Parallel and Distributed Computing, vol.43, issue.2, pp.125-138, 1997.
DOI : 10.1006/jpdc.1997.1336

J. Plank, An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance, 1997.

J. S. Plank and K. Li, ICKP: a consistent checkpointer for multicomputers. Parallel Distributed Technology: Systems Applications, IEEE, vol.2, issue.2, pp.62-67

N. Raju, Y. Gottumukkala, C. B. Liu, R. Leangsuksun, S. Nassar et al., Reliability Analysis in HPC clusters, Proceedings of the High Availability and Performance Computing Workshop, 2006.

Y. Saad, A Flexible Inner-Outer Preconditioned GMRES Algorithm, SIAM Journal on Scientific Computing, vol.14, issue.2, pp.461-469, 1993.
DOI : 10.1137/0914028

Y. Saad, Iterative Methods for Sparse Linear Systems, Society for Industrial and Applied Mathematics, 2003.
DOI : 10.1137/1.9780898718003

Y. Saad and M. H. Schultz, GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems, SIAM Journal on Scientific and Statistical Computing, vol.7, issue.3, pp.856-869, 1986.
DOI : 10.1137/0907058

J. C. Sancho, F. Petrini, K. Davis, R. Gioiosa, and S. Jiang, Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance, 19th IEEE International Parallel and Distributed Processing Symposium, pp.10-1109, 2005.
DOI : 10.1109/IPDPS.2005.157

M. Scholzel, Reduced Triple Modular redundancy for built-in self-repair in VLIW-processors, Signal Processing Algorithms, Architectures, Arrangements, and Applications SPA 2007, pp.21-26, 2007.
DOI : 10.1109/SPA.2007.5903294

B. F. Smith, W. Bjørstad, and . Gropp, Domain Decomposition, Parallel Multilevel Methods for Elliptic Partial Differential Equations, 1996.

K. Teranishi and M. A. Heroux, Toward Local Failure Local Recovery Resilience Model using MPI-ULFM, Proceedings of the 21st European MPI Users' Group Meeting on, EuroMPI/ASIA '14, pp.51-51
DOI : 10.1145/2642769.2642774

T. N. Vijaykumar, K. Pomeranz, and . Cheng, Transient-fault recovery using simultaneous multithreading, Computer Architecture Proceedings. 29th Annual International Symposium on, pp.87-98, 2002.

. John-von-neumann, Probabilistic logics and the synthesis of reliable organisms from unreliable components, Automata Studies, pp.43-98, 1956.

C. Wang, F. Mueller, C. Engelmann, and S. L. Scott, Hybrid full/incremental checkpoint/restart for MPI jobs in HPC environments, 2009.

C. Weaver and T. M. Austin, A fault tolerant approach to microprocessor design, Proceedings International Conference on Dependable Systems and Networks, pp.411-420, 2001.
DOI : 10.1109/DSN.2001.941425

W. Weibull, A statistical distribution function of wide applicability, Journal of Applied Mechanics, vol.18, pp.293-297, 1951.