E. Agullo, L. Giraud, A. Guermouche, J. Roman, and M. Zounon, Towards resilient parallel linear Krylov solvers: recover-restart strategies, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00843992

E. Agullo, L. Giraud, P. Salas, and M. Zounon, On resiliency in some parallel eigensolvers, Research Report, vol.8625, 2015.

E. Agullo, L. Giraud, and M. Zounon, On the Resilience of Parallel Sparse Hybrid Solvers, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC), pp.75-84, 2015.
DOI : 10.1109/HiPC.2015.9

URL : https://hal.archives-ouvertes.fr/hal-01256316

L. Alvisi and K. Marzullo, Message logging: pessimistic, optimistic, causal, and optimal, IEEE Transactions on Software Engineering, vol.24, issue.2, pp.149-159, 1998.
DOI : 10.1109/32.666828

J. Anfinson and F. T. Luk, A linear algebraic model of algorithm-based fault tolerance, IEEE Transactions on Computers, vol.37, issue.12, pp.1599-1604, 1988.
DOI : 10.1109/12.9736

W. E. Arnoldi, The principle of minimized iterations in the solution of the matrix eigenvalue problem, Quarterly of Applied Mathematics, vol.9, issue.1, pp.17-29, 1951.
DOI : 10.1090/qam/42792

M. Todd and . Austin, DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design, Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 32, pp.196-207, 1999.

A. Borg, J. Baumbach, and S. Glazer, A message system supporting fault tolerance, ACM SIGOPS Operating Systems Review, vol.17, issue.5, pp.90-99, 1983.
DOI : 10.1145/773379.806617

F. Cappello, H. Casanova, and Y. Robert, Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Parallel Processing Letters, pp.111-132, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00945068

Z. Chen, Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods, In ACM SIGPLAN Notices, vol.48, pp.167-176, 2013.

A. Timothy, Y. Davis, and . Hu, The University of Florida sparse matrix collection. j-TOMS, pp.1-125, 2011.

E. N. Mootaz, L. Elnozahy, Y. Alvisi, D. B. Wang, and . Johnson, A Survey of Rollback-recovery Protocols in Message-passing Systems, ACM Comput. Surv, vol.34, issue.3, pp.375-408, 2002.

E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel, The performance of consistent checkpointing, [1992] Proceedings 11th Symposium on Reliable Distributed Systems, pp.39-47, 1992.
DOI : 10.1109/RELDIS.1992.235144

R. Diederik, G. L. Fokkema, H. A. Sleijpen, . Van, and . Vorst, Jacobi-Davidson style QR and QZ algorithms for the partial reduction of matrix pencils, SIAM J. SCI. COMPUT, vol.20, pp.94-125, 1996.

J. A. Gunnels, R. A. Van-de-geijn, D. S. Katz, and E. S. Quintana-ortí, Fault-tolerant highperformance matrix multiplication: Theory and practice, Dependable Systems and Networks, pp.47-56, 2001.

G. W. Stewart, Matrix algorithms ?, Eigensystems. SIAM, vol.II, 2001.

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, pp.518-528, 1984.

R. K. Iyer, N. M. Nakka, Z. T. Kalbarczyk, and S. Mitra, Recent Advances and New Avenues in Hardware-Level Reliability Support, IEEE Micro, vol.25, issue.6, pp.2518-2547, 2005.
DOI : 10.1109/MM.2005.119

L. Jaulmes, M. Casas, M. Moretó, E. Ayguade, J. Labarta et al., Exploiting asynchrony from exact forward recovery for DUE in iterative solvers, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '15, 2015.
DOI : 10.1145/2807591.2807599

B. David, W. Johnson, and . Zwaenepoel, Sender-based message logging, 1987.

J. Langou, Z. Chen, G. Bosilca, and J. Dongarra, Recovery Patterns for Iterative Methods in a Parallel Unstable Environment, SIAM Journal on Scientific Computing, vol.30, issue.1, pp.102-116, 2007.
DOI : 10.1137/040620394

R. B. Lehoucq and D. C. Sorensen, Deflation Techniques for an Implicitly Restarted Arnoldi Iteration, SIAM Journal on Matrix Analysis and Applications, vol.17, issue.4, pp.789-821, 1996.
DOI : 10.1137/S0895479895281484

C. J. Li and W. K. Fuchs, Catch-compiler-assisted techniques for checkpointing FTCS-20, Fault-Tolerant Computing Digest of Papers., 20th International Symposium, pp.74-81, 1990.

Y. Liu, R. Nassar, C. B. Leangsuksun, N. Naksinehaboon, M. Paun et al., An optimal checkpoint/restart model for a large scale high performance computing system, IEEE International Symposium on Parallel and Distributed Processing, pp.1-910, 2008.

N. Oh, P. P. Shirvani, and E. J. Mccluskey, Error detection by duplicated instructions in super-scalar processors. Reliability, IEEE Transactions on, vol.51, issue.1, pp.63-75, 2002.

J. S. Plank, Y. Kim, and J. Dongarra, Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing, Journal of Parallel and Distributed Computing, vol.43, issue.2, pp.125-138, 1997.
DOI : 10.1006/jpdc.1997.1336

J. Plank, An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance, 1997.

J. S. Plank and K. Li, ICKP: a consistent checkpointer for multicomputers. Parallel Distributed Technology: Systems Applications, IEEE, vol.2, issue.2, pp.62-67

N. Raju, Y. Gottumukkala, C. B. Liu, R. Leangsuksun, S. Nassar et al., Reliability Analysis in HPC clusters, Proceedings of the High Availability and Performance Computing Workshop, 2006.

Y. Saad, Numerical Methods for Large Eigenvalue Problems, 1992.
DOI : 10.1137/1.9781611970739

P. Salas, Physical and numerical aspects of thermoacoustic instabilities in annular combustion chambers, 2013.
URL : https://hal.archives-ouvertes.fr/tel-00937020

P. Salas, L. Giraud, Y. Saad, and S. Moreau, Spectral recycling strategies for the solution of nonlinear eigenproblems in thermoacoustics. Numerical Linear Algebra with Applications, pp.1039-1058, 1995.
URL : https://hal.archives-ouvertes.fr/hal-01238263

J. C. Sancho, F. Petrini, K. Davis, R. Gioiosa, and S. Jiang, Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance, 19th IEEE International Parallel and Distributed Processing Symposium, pp.10-1109, 2005.
DOI : 10.1109/IPDPS.2005.157

M. Scholzel, Reduced Triple Modular redundancy for built-in self-repair in VLIW-processors, Signal Processing Algorithms, Architectures, Arrangements, and Applications SPA 2007, pp.21-26, 2007.
DOI : 10.1109/SPA.2007.5903294

L. G. Gerard, . Sleijpen, and A. Henk, Van der. A Jacobi?Davidson iteration method for linear eigenvalue problems, SIAM Rev, vol.42, issue.2, pp.267-293, 2000.

T. N. Vijaykumar, K. Pomeranz, and . Cheng, Transient-fault recovery using simultaneous multithreading, Proceedings of the 29th Annual International Symposium on Computer Architecture, pp.87-98, 2002.

. John-von-neumann, Probabilistic logics and the synthesis of reliable organisms from unreliable components, Automata Studies, pp.43-98, 1956.

C. Wang, F. Mueller, C. Engelmann, and S. L. Scott, Hybrid full/incremental checkpoint/restart for MPI jobs in HPC environments, 2009.

C. Weaver and T. M. Austin, A fault tolerant approach to microprocessor design, Proceedings International Conference on Dependable Systems and Networks, pp.411-420, 2001.
DOI : 10.1109/DSN.2001.941425

W. Weibull, A statistical distribution function of wide applicability, Journal of Applied Mechanics, vol.18, pp.293-297, 1951.

M. Zounon, On numerical resilience in linear algebra. Theses, 2015.
URL : https://hal.archives-ouvertes.fr/tel-01231838