A fast error correction technique for matrix multiplication algorithms, 15th IEEE International On-Line Testing Symposium, pp.133-137, 2009. ,
Coping with silent and fail-stop errors at scale by combining replication and checkpointing, J. Parallel Distributed Comput, vol.122, pp.209-225, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-02082389
Algorithmbased fault tolerance applied to high performance computing, J. Parallel Distrib. Comput, vol.69, pp.410-416, 2009. ,
,
Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy, ACM Trans. Parallel Comput, vol.1, issue.2, p.28, 2015. ,
Toward Exascale Resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, issue.1, 2014. ,
Algorithm-based checkpoint-free fault tolerance for parallel matrix multiplications on volatile resources, Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium, 2006. ,
Condition numbers of gaussian random matrices, SIAM Journal on Matrix Analysis and Applications, vol.27, issue.3, pp.603-620, 2005. ,
, Numerically stable real number codes based on random matrices, vol.3514, 2005.
A backward/forward recovery approach for the preconditioned conjugate gradient method, J. Computational Science, vol.17, pp.522-534, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01354682
Faulttolerant high-performance matrix multiplication: Theory and practice, Proceedings of the International Conference on Dependable Systems and Networks, pp.47-56, 2001. ,
, Fault-Tolerance Techniques for High-Performance Computing. Computer Communications and Networks, 2015.
A new approach to probabilistic rounding error analysis, SIAM Journal on Scientific Computing, vol.41, issue.5, pp.2815-2835, 2019. ,
URL : https://hal.archives-ouvertes.fr/hal-02311269
Algorithm-based fault tolerance for matrix operations, IEEE Trans. on Comp. (Spec. Issue Reliable & Fault-Tolerant Comp.), vol.33, pp.518-528, 1984. ,
The use of triple-modular redundancy to improve computer reliability, IBM J. Res. Dev, vol.6, issue.2, pp.200-209, 1962. ,
ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, 2013. ,
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems, Software -Practice & Experience, vol.27, issue.9, pp.995-1012, 1997. ,
Algorithm based fault tolerance versus result-checking for matrix computations, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352), pp.4-11, 1999. ,
,
Fault-detection by result-checking for the eigenproblem, Dependable Computing -EDCC-3, pp.419-436, 1999. ,
Polynomial codes over certain finite fields, Journal of the Society for Industrial and Applied Mathematics, vol.8, issue.2, pp.300-304, 1960. ,
Algorithm-based fault location and recovery for matrix computations on muitiprocessor systems, IEEE Transactions on Computers, vol.45, issue.11, 1996. ,
, FLAME Working Note, vol.76, 2015.
Towards practical algorithm based fault tolerance in dense linear algebra, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, pp.31-42, 2016. ,