On the Combination of Silent Error Detection and Checkpointing, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, pp.11-20, 2013. ,
DOI : 10.1109/PRDC.2013.10
URL : https://hal.archives-ouvertes.fr/hal-00836871
Which Verification for Soft Error Detection?, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC), 2015. ,
DOI : 10.1109/HiPC.2015.26
URL : https://hal.archives-ouvertes.fr/hal-01252382
Assessing general-purpose algorithms to cope with fail-stop and silent errors, Proceedings of the 5th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01358146
Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) ,
DOI : 10.1109/IPDPS.2016.39
URL : https://hal.archives-ouvertes.fr/hal-01354886
Two-level checkpointing and partial verifications for linear task graphs, Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2015. ,
DOI : 10.1109/ipdpsw.2016.106
URL : https://hal.archives-ouvertes.fr/hal-01252400
Silent error detection in numerical time-stepping schemes, International Journal of High Performance Computing Applications, vol.29, issue.4, pp.10-1177, 2014. ,
DOI : 10.1177/1094342014532297
Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009. ,
DOI : 10.1016/j.jpdc.2008.12.002
Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011. ,
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504
Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008. ,
DOI : 10.1145/1375527.1375552
Assessing the Impact of Partial Verifications against Silent Data Corruptions, 2015 44th International Conference on Parallel Processing, 2015. ,
DOI : 10.1109/ICPP.2015.53
URL : https://hal.archives-ouvertes.fr/hal-01253493
Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985. ,
DOI : 10.1145/214451.214456
Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, Proceedings of the 18th Symposium on Principles and Practice of Parallel Programming, pp.167-176, 2013. ,
Spectral Graph Theory, 1997. ,
DOI : 10.1090/cbms/092
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006. ,
DOI : 10.1016/j.future.2004.11.016
The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009. ,
DOI : 10.1177/1094342009347714
Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp.615-626, 2012. ,
DOI : 10.1109/ICDCS.2012.56
A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002. ,
DOI : 10.1145/568522.568525
The case for modular redundancy in large-scale highh performance computing systems, Proceeding of the 8th IASTED Infernational Conference on Parallel and Distributed Computing and Networks (PDCN), pp.189-194, 2009. ,
A backward/forward recovery approach for the preconditioned conjugate gradient method, Journal of Computational Science, vol.17 ,
DOI : 10.1016/j.jocs.2016.04.008
URL : https://hal.archives-ouvertes.fr/hal-01354682
Combining Algorithm-based Fault Tolerance and Checkpointing for Iterative Solvers Short version appears in the proceedings of PDSEC, 2015. ,
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.441-4412, 2011. ,
DOI : 10.1145/2063384.2063443
Fault-tolerant iterative methods via selective reliability, Research report SAND2011-3915 C, Sandia National Laboratories, 2011. ,
Algorithm-based fault tolerance for matrix operations, IEEE Transactions on Computers, vol.33, issue.6, pp.518-528, 1984. ,
Cosmic rays don't strike twice, ACM SIGARCH Computer Architecture News, vol.40, issue.1, pp.111-122, 2012. ,
DOI : 10.1145/2189750.2150989
When is multi-version checkpointing needed?, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13, pp.49-56, 2013. ,
DOI : 10.1145/2465813.2465821
The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962. ,
DOI : 10.1147/rd.62.0200
Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005. ,
DOI : 10.1017/CBO9780511813603
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10), 2010. ,
DOI : 10.2172/984082
ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013. ,
DOI : 10.1145/2503210.2503266
The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Transactions on Electron Devices, vol.41, issue.4, pp.553-557, 1994. ,
DOI : 10.1109/16.278509
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle, IEEE Transactions on Dependable and Secure Computing, vol.3, issue.2, pp.130-140, 2006. ,
DOI : 10.1109/TDSC.2006.22
Iterative methods for sparse linear systems, 2003. ,
DOI : 10.1137/1.9780898718003
Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013. ,
DOI : 10.1145/2530268.2530272
Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007. ,
DOI : 10.1088/1742-6596/78/1/012022
Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78, 2012. ,
DOI : 10.1145/2304576.2304588
On the Optimum Checkpoint Selection Problem, SIAM Journal on Computing, vol.13, issue.3, pp.630-649, 1984. ,
DOI : 10.1137/0213039
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009. ,
DOI : 10.1109/CLUSTR.2009.5289177
Accelerated testing for cosmic soft-error rate, IBM Journal of Research and Development, vol.40, issue.1, pp.51-72, 1996. ,
DOI : 10.1147/rd.401.0051
Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits, vol.33, issue.2, pp.246-252, 1998. ,
DOI : 10.1109/4.658626
IBM experiments in soft fails in computer electronics (1978???1994), IBM Journal of Research and Development, vol.40, issue.1, pp.3-18, 1996. ,
DOI : 10.1147/rd.401.0003