C. Argyrides, C. A. Lisboa, D. K. Pradhan, and L. Carro, A fast error correction technique for matrix multiplication algorithms, 15th IEEE International On-Line Testing Symposium, pp.133-137, 2009.

A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert et al., Coping with silent and fail-stop errors at scale by combining replication and checkpointing, J. Parallel Distributed Comput, vol.122, pp.209-225, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02082389

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithmbased fault tolerance applied to high performance computing, J. Parallel Distrib. Comput, vol.69, pp.410-416, 2009.


A. Bouteiller, T. Herault, G. Bosilca, P. Du, and J. J. Dongarra, Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy, ACM Trans. Parallel Comput, vol.1, issue.2, p.28, 2015.

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, issue.1, 2014.

Z. Chen and J. Dongarra, Algorithm-based checkpoint-free fault tolerance for parallel matrix multiplications on volatile resources, Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium, 2006.

Z. Chen and J. J. Dongarra, Condition numbers of gaussian random matrices, SIAM Journal on Matrix Analysis and Applications, vol.27, issue.3, pp.603-620, 2005.

Z. Chen and J. J. Dongarra, Numerically stable real number codes based on random matrices, vol.3514, 2005.

M. Fasi, J. Langou, Y. Robert, and B. Uçar, A backward/forward recovery approach for the preconditioned conjugate gradient method, J. Computational Science, vol.17, pp.522-534, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01354682

J. Gunnels, D. Katz, E. Quintana-ortí, and R. Van-de-geijn, Faulttolerant high-performance matrix multiplication: Theory and practice, Proceedings of the International Conference on Dependable Systems and Networks, pp.47-56, 2001.

, Fault-Tolerance Techniques for High-Performance Computing. Computer Communications and Networks, 2015.

N. J. Higham and T. Mary, A new approach to probabilistic rounding error analysis, SIAM Journal on Scientific Computing, vol.41, issue.5, pp.2815-2835, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02311269

K. Huang and J. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. on Comp. (Spec. Issue Reliable & Fault-Tolerant Comp.), vol.33, pp.518-528, 1984.

R. E. Lyons and W. Vanderkulk, The use of triple-modular redundancy to improve computer reliability, IBM J. Res. Dev, vol.6, issue.2, pp.200-209, 1962.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, 2013.

J. S. Plank, A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems, Software -Practice & Experience, vol.27, issue.9, pp.995-1012, 1997.

P. Prata and J. G. Silva, Algorithm based fault tolerance versus result-checking for matrix computations, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352), pp.4-11, 1999.


P. Prata and J. G. Silva, Fault-detection by result-checking for the eigenproblem, Dependable Computing -EDCC-3, pp.419-436, 1999.

I. S. Reed and G. Solomon, Polynomial codes over certain finite fields, Journal of the Society for Industrial and Applied Mathematics, vol.8, issue.2, pp.300-304, 1960.

A. Roy-chowdhury and P. Banerjee, Algorithm-based fault location and recovery for matrix computations on muitiprocessor systems, IEEE Transactions on Computers, vol.45, issue.11, 1996.

T. M. Smith, R. A. Van-de-geijn, M. Smelyanskiy, and E. S. Quintana-ortí, FLAME Working Note, vol.76, 2015.

P. Wu, Q. Guan, N. Debardeleben, S. Blanchard, D. Tao et al., Towards practical algorithm based fault tolerance in dense linear algebra, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, pp.31-42, 2016.