Combining Backward and Forward Recovery to Cope with Silent Errors in Iterative Solvers, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pp.980-989, 2015. ,
DOI : 10.1109/IPDPSW.2015.22
URL : https://hal.archives-ouvertes.fr/hal-01159679
Design, modeling , and evaluation of a scalable multi-level checkpointing system, High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, pp.1-11, 2010. ,
DOI : 10.2172/984082
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities, International Journal of High Performance Computing Applications, vol.23, issue.3, pp.212-226, 2009. ,
DOI : 10.1177/1094342009106189
Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009. ,
DOI : 10.1177/1094342009347767
Toward Exascale Resilience, open Access, p.14 ,
DOI : 10.1177/1094342009347767
Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, 12022. ,
DOI : 10.1088/1742-6596/78/1/012022
Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods, Proc. 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pp.167-176, 2013. ,
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004. ,
DOI : 10.1016/j.future.2004.11.016
Assessing general-purpose algorithms to cope with fail-stop and silent errors, Performance Modeling, Benchmarking and Simulation (PMBS), 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01358146
Algorithm-Based Fault Tolerance for Matrix Operations, Computers, IEEE Transactions on C?, vol.33, issue.6, pp.518-528, 1984. ,
Algorithm-based fault tolerance for dense matrix factorizations, pp.225-234, 2012. ,
DOI : 10.1145/2370036.2145845
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.498.9892
Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance, International Journal of High Performance Computing Applications, vol.29, issue.4, pp.422-436, 2015. ,
DOI : 10.1177/1094342015578487
Fail-stop failure algorithm-based fault tolerance for cholesky decomposition, Parallel and Distributed Systems, IEEE Transactions on, vol.26, issue.5, pp.1323-1335, 2015. ,
DOI : 10.1109/tpds.2014.2320502
Soft error resilient QR factorization for hybrid system with GPGPU, scalable Algorithms for Large-Scale Systems Workshop (ScalA2011), pp.457-464, 2011. ,
DOI : 10.1016/j.jocs.2013.01.004
Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78, 2012. ,
DOI : 10.1145/2304576.2304588
A Sparse Approximate Inverse Preconditioner for Nonsymmetric Linear Systems, SIAM Journal on Scientific Computing, vol.19, issue.3, pp.968-994, 1998. ,
DOI : 10.1137/S1064827595294691
Analysis of Partitioning Models and Metrics in Parallel Sparse Matrix-Vector Multiplication, Parallel Processing and Applied Mathematics (PPAM2014), pp.174-184, 2014. ,
DOI : 10.1007/978-3-642-55195-6_16
URL : https://hal.archives-ouvertes.fr/hal-00821523
Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014. ,
DOI : 10.1016/j.jpdc.2013.10.010
URL : https://hal.archives-ouvertes.fr/hal-00788313
The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962. ,
DOI : 10.1147/rd.62.0200
Detection and correction of silent data corruption for large-scale highperformance computing, Proc. of the ACM/IEEE SC Int. Conf., SC '12, 2012. ,
Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012. ,
DOI : 10.1109/ICDCS.2012.56
When is multi-version checkpointing needed?, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13, 2013. ,
DOI : 10.1145/2465813.2465821
Cosmic rays don't strike twice, ACM SIGARCH Computer Architecture News, vol.40, issue.1, pp.111-122, 2012. ,
DOI : 10.1145/2189750.2150989
Fault-tolerant iterative methods via selective reliability, Tech. rep., Sandia Corporation, 2011. ,
Fault-Tolerant Iterative Methods via Selective Reliability Storage and Analysis (SC), Proceedings of the 2011 International Conference for High Performance Computing, Networking, p.9, 2011. ,
Fault-tolerant linear solvers via selective reliability, p.preprint, 2012. ,
Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013. ,
DOI : 10.1145/2530268.2530272
Silent error detection in numerical time-stepping schemes, International Journal of High Performance Computing Applications, vol.29, issue.4 ,
DOI : 10.1177/1094342014532297
Fault-tolerant iterative methods via selective reliability, Research report SAND2011-3915 C, Sandia National Laboratories, 2011. ,
Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008. ,
DOI : 10.1145/1375527.1375552
A linear algebraic model of algorithm-based fault tolerance, IEEE Transactions on Computers, vol.37, issue.12, pp.1599-1604, 1988. ,
DOI : 10.1109/12.9736
Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009. ,
DOI : 10.1016/j.jpdc.2008.12.002
Algorithmic approaches to low overhead fault detection for sparse linear algebra, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012), pp.1-12, 2012. ,
DOI : 10.1109/DSN.2012.6263938
Iterative Methods for Sparse Linear Systems, 2003. ,
DOI : 10.1137/1.9780898718003
Spectral Graph Theory, 1997. ,
DOI : 10.1090/cbms/092
Eigenvalues and Condition Numbers of Random Matrices, SIAM Journal on Matrix Analysis and Applications, vol.9, issue.4, pp.543-560, 1988. ,
DOI : 10.1137/0609045
Condition Numbers of Gaussian Random Matrices, SIAM Journal on Matrix Analysis and Applications, vol.27, issue.3, pp.603-620, 2005. ,
DOI : 10.1137/040616413
Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011. ,
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504
Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005. ,
DOI : 10.1017/CBO9780511813603
The university of Florida sparse matrix collection, ACM Transactions on Mathematical Software, vol.38, issue.1, pp.1-1, 2011. ,
DOI : 10.1145/2049662.2049663
Stabilized and block approximate inverse preconditioners for problems in solid and structural mechanics, Computer Methods in Applied Mechanics and Engineering, vol.190, issue.49-50, pp.49-50, 2001. ,
DOI : 10.1016/S0045-7825(01)00235-3
Accuracy and Stability of Numerical Algorithms, 2002. ,
DOI : 10.1137/1.9780898718027
Combining Algorithm-based Fault Tolerance and Checkpointing for Iterative Solvers, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01111707
Functions of Matrices: Theory and Computation, 2008. ,
DOI : 10.1137/1.9780898717778
Quantifying the impact of single bit flips on floating point arithmetic, p.preprint, 2013. ,
DOI : 10.2172/1089338
Quantifying the impact of single bit flips on floating point arithmetic, 2013. ,