A linear algebraic model of algorithm-based fault tolerance, IEEE Transactions on Computers, vol.37, issue.12, pp.1599-1604, 1988. ,
DOI : 10.1109/12.9736
Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014. ,
DOI : 10.1016/j.jpdc.2013.10.010
URL : https://hal.archives-ouvertes.fr/hal-00788313
Assessing generalpurpose algorithms to cope with fail-stop and silent errors, Workshop on Performance Modeling, Benchmarking and Simulation (PMBS), 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01358146
Silent error detection in numerical time-stepping schemes, International Journal of High Performance Computing Applications, vol.29, issue.4, 1312. ,
DOI : 10.1177/1094342014532297
Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009. ,
DOI : 10.1016/j.jpdc.2008.12.002
Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011. ,
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504
Faulttolerant linear solvers via selective reliability, p.2012 ,
Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008. ,
DOI : 10.1145/1375527.1375552
Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods, Proc. 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP '13, pp.167-176, 2013. ,
A higher order estimate of the optimum checkpoint interval for restart dumps The University of Florida Sparse Matrix Collection, FGCS ACM Trans. Math. Softw, vol.22, issue.38 1 1, pp.303-3121, 2004. ,
Algorithmbased fault tolerance for dense matrix factorizations, pp.225-234, 2012. ,
Quantifying the impact of single bit flips on floating point arithmetic, 2013. ,
DOI : 10.2172/1089338
Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012. ,
DOI : 10.1109/ICDCS.2012.56
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.228.2542
Combining Algorithm-based Fault Tolerance and Checkpointing for Iterative Solvers Available: https, 2015. ,
Detection and correction of silent data corruption for large-scale high-performance computing, Proc. of the ACM/IEEE SC Int. Conf., ser. SC '12, 2012. ,
Fault-tolerant iterative methods via selective reliability, Sandia National Laboratories, 2011. ,
Accuracy and Stability of Numerical Algorithms, 2002. ,
DOI : 10.1137/1.9780898718027
Fault-tolerant iterative methods via selective reliability, Sandia Corporation, Tech. Rep, 2011. ,
Algorithm-Based Fault Tolerance for Matrix Operations, Computers, IEEE Transactions, vol.33, issue.6, pp.518-528, 1984. ,
Cosmic rays don't strike twice, ACM SIGARCH Computer Architecture News, vol.40, issue.1, pp.111-122, 2012. ,
DOI : 10.1145/2189750.2150989
Analysis of Partitioning Models and Metrics in Parallel Sparse Matrix-Vector Multiplication, Parallel Processing and Applied Mathematics, pp.174-184, 2014. ,
DOI : 10.1007/978-3-642-55195-6_16
URL : https://hal.archives-ouvertes.fr/hal-00821523
When is multi-version checkpointing needed?, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13 ,
DOI : 10.1145/2465813.2465821
The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962. ,
DOI : 10.1147/rd.62.0200
Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005. ,
DOI : 10.1017/CBO9780511813603
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proc. of the ACM, pp.1-11, 2010. ,
Iterative Methods for Sparse Linear Systems, 2003. ,
DOI : 10.1137/1.9780898718003
Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013. ,
DOI : 10.1145/2530268.2530272
Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, 2012. ,
DOI : 10.1145/2304576.2304588
Algorithmic approaches to low overhead fault detection for sparse linear algebra, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012), pp.1-12, 2012. ,
DOI : 10.1109/DSN.2012.6263938
Quantifying the impact of single bit flips on floating point arithmetic, Oak Ridge National Laboratory, Tech. Rep, 2013. ,
Laplacian matrix, 2014. ,
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115