A linear algebraic model of algorithm-based fault tolerance, IEEE Transactions on Computers, vol.37, issue.12, pp.1599-1604, 1988. ,
DOI : 10.1109/12.9736
Assessing general-purpose algorithms to cope with fail-stop and silent errors, Workshop on Performance Modeling, Benchmarking and Simulation ( PMBS), 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01358146
Silent error detection in numerical time-stepping schemes, International Journal of High Performance Computing Applications, vol.29, issue.4, 1312. ,
DOI : 10.1177/1094342014532297
Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009. ,
DOI : 10.1016/j.jpdc.2008.12.002
Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011. ,
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504
Fault-tolerant linear solvers via selective reliability. preprint, 2012. ,
Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008. ,
DOI : 10.1145/1375527.1375552
Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods, Proc. 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pp.167-176, 2013. ,
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004. ,
DOI : 10.1016/j.future.2004.11.016
The university of Florida sparse matrix collection, ACM Transactions on Mathematical Software, vol.38, issue.1, pp.1-125, 2011. ,
DOI : 10.1145/2049662.2049663
Algorithm-based fault tolerance for dense matrix factorizations, PPoPP, pp.225-234, 2012. ,
Sparse matrix test problems, ACM Transactions on Mathematical Software, vol.15, issue.1, pp.1-14, 1989. ,
DOI : 10.1145/62038.62043
Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012. ,
DOI : 10.1109/ICDCS.2012.56
Quantifying the impact of single bit flips on floating point arithmetic. preprint, 2013. ,
Detection and correction of silent data corruption for large-scale high-performance computing, Proc. of the ACM/IEEE SC Int. Conf., SC '12, 2012. ,
Fault-tolerant iterative methods via selective reliability, Research report SAND2011-3915 C, Sandia National Laboratories, 2011. ,
Accuracy and Stability of Numerical Algorithms, 2002. ,
DOI : 10.1137/1.9780898718027
Functions of Matrices: Theory and Computation, 2008. ,
DOI : 10.1137/1.9780898717778
Fault-tolerant iterative methods via selective reliability, 2011. ,
Fault-Tolerant Iterative Methods via Selective Reliability, Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), p.9, 2011. ,
Algorithm-Based Fault Tolerance for, Matrix Operations. Computers, IEEE Transactions, vol.33, issue.6, pp.518-528, 1984. ,
Cosmic rays don't strike twice: understanding the nature of dram errors and the implications for system design ,
When is multi-version checkpointing needed?, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13, 2013. ,
DOI : 10.1145/2465813.2465821
The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962. ,
DOI : 10.1147/rd.62.0200
Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005. ,
DOI : 10.1017/CBO9780511813603
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2010. ,
DOI : 10.1109/SC.2010.18
Iterative Methods for Sparse Linear Systems, 2003. ,
DOI : 10.1137/1.9780898718003
Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013. ,
DOI : 10.1145/2530268.2530272
Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, 2012. ,
DOI : 10.1145/2304576.2304588
Algorithmic approaches to low overhead fault detection for sparse linear algebra, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012), pp.1-12, 2012. ,
DOI : 10.1109/DSN.2012.6263938
Quantifying the impact of single bit flips on floating point arithmetic, 2013. ,
Laplacian matrix. From MathWorld?A Wolfram Web Resource, 2014. ,
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115