Detecting silent data corruption through data dynamic monitoring for scientific applications, SIGPLAN Notices, vol.49, issue.8, pp.381-382, 2014. ,
Detecting and correcting data corruption in stencil applications through multivariate interpolation, Proc. FTS'15, 2015. ,
FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063427
URL : https://hal.archives-ouvertes.fr/hal-00721216
Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016. ,
DOI : 10.1109/IPDPS.2016.39
URL : https://hal.archives-ouvertes.fr/hal-01354886
Silent error detection in numerical time-stepping schemes, High Performance Computing Applications, 2014. ,
DOI : 10.1177/1094342014532297
Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications, Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC '15, 2015. ,
DOI : 10.1145/2749246.2749253
Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009. ,
DOI : 10.1016/j.jpdc.2008.12.002
Unified model for assessing checkpointing protocols at extreme-scale. Concurrency and Computation: Practice and Experience, pp.2772-2791, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-00696154
Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, 2008. ,
DOI : 10.1145/1375527.1375552
Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, p.2014 ,
DOI : 10.1177/1094342009347767
Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985. ,
DOI : 10.1145/214451.214456
Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, PPoPP, 2013. ,
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006. ,
DOI : 10.1016/j.future.2004.11.016
Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014. ,
DOI : 10.1109/IPDPS.2014.122
Performance and reliability trade-offs for the double checkpointing algorithm, International Journal of Networking and Computing, vol.4, issue.1, pp.23-41, 2014. ,
DOI : 10.15803/ijnc.4.1_23
URL : https://hal.archives-ouvertes.fr/hal-01091928
Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp.615-626, 2012. ,
DOI : 10.1109/ICDCS.2012.56
Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004. ,
DOI : 10.1109/TDSC.2004.15
A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002. ,
DOI : 10.1145/568522.568525
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.441-4412, 2011. ,
DOI : 10.1145/2063384.2063443
Detection and correction of silent data corruption for large-scale high-performance computing, Proc. SC'12, p.78, 2012. ,
Fault-tolerant iterative methods via selective reliability, Research report SAND2011-3915 C, Sandia National Laboratories, 2011. ,
Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984. ,
Optimizing HPC Fault-Tolerant Environment: An Analytical Approach, 2010 39th International Conference on Parallel Processing, 2010. ,
DOI : 10.1109/ICPP.2010.80
The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962. ,
DOI : 10.1147/rd.62.0200
Design, modeling, and evaluation of a scalable multi-level checkpointing system, Proc. SC'10, pp.1-11, 2010. ,
ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013. ,
DOI : 10.1145/2503210.2503266
The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Transactions on Electron Devices, vol.41, issue.4, pp.553-557, 1994. ,
DOI : 10.1109/16.278509
Diskless checkpointing, IEEE Transactions on Parallel and Distributed Systems, vol.9, issue.10, pp.972-986, 1998. ,
DOI : 10.1109/71.730527
Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013. ,
DOI : 10.1145/2530268.2530272
Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78, 2012. ,
DOI : 10.1145/2304576.2304588
Using two-level stable storage for efficient checkpointing, IEE Proceedings - Software, vol.145, issue.6, pp.198-202, 1998. ,
DOI : 10.1049/ip-sen:19982440
A case for two-level distributed recovery schemes, ACM SIGMETRICS Performance Evaluation Review, vol.23, issue.1, pp.64-73, 1995. ,
DOI : 10.1145/223586.223596
Canaries in a Coal Mine: Using Application-Level Checkpoints to Detect Memory Failures, Euro-Par'15: Parallel Processing Workshops. Springer LNCS 9523, 2015. ,
DOI : 10.1007/978-3-319-27308-2_54
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, Proc. CLUSTER'04, pp.93-103, 2004. ,
Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart, IEEE Transactions on Computers, vol.64, issue.5, pp.1402-1415, 2015. ,
DOI : 10.1109/TC.2014.2317182
Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits, vol.33, issue.2, pp.246-252, 1998. ,
DOI : 10.1109/4.658626
Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399 ,