Basic concepts and taxonomy of dependable and secure computing, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.1, pp.11-33, 2004. ,
DOI : 10.1109/TDSC.2004.2
Which verification for soft error detection? In HiPC, 2015. ,
Detecting silent data corruption through data dynamic monitoring for scientific applications, PPoPP. ACM, 2014. ,
Detecting and correcting data corruption in stencil applications through multivariate interpolation, FTS. IEEE, 2015. ,
Exploiting spatial smoothness in HPC applications to detect silent data corruption, HPCC. IEEE, 2015. ,
Assessing general-purpose algorithms to cope with fail-stop and silent errors, PMBS. ACM, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01066664
Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016. ,
DOI : 10.1109/IPDPS.2016.39
URL : https://hal.archives-ouvertes.fr/hal-01215857
Silent error detection in numerical time-stepping schemes, High Performance Computing Applications, 2014. ,
DOI : 10.1021/ct400489c
Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications, Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC '15, 2015. ,
DOI : 10.1145/1810085.1810120
Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009. ,
DOI : 10.1016/j.jpdc.2008.12.002
Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, 2008. ,
DOI : 10.1145/1375527.1375552
Improving the trust in results of numerical simulations and scientific data analytics, 2015. ,
DOI : 10.2172/1179023
Toward Exascale Resilience, The International Journal of High Performance Computing Applications, vol.29, issue.2, pp.374-388, 2009. ,
DOI : 10.1515/9781400882618-003
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.232.7068
Toward Exascale Resilience, The International Journal of High Performance Computing Applications, vol.29, issue.2, p.2014 ,
DOI : 10.1515/9781400882618-003
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.232.7068
Using group replication for resilience on exascale systems, Int. Journal of High Performance Computing Applications, vol.28, issue.2, pp.210-224, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-00668016
On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Computer Systems, vol.51, pp.7-19, 2015. ,
DOI : 10.1016/j.future.2015.04.003
URL : https://hal.archives-ouvertes.fr/hal-01199752
Application-level fault tolerance in the orbital thermal imaging spectrometer, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings., 2004. ,
DOI : 10.1109/PRDC.2004.1276551
Programming Models and Development Software for a Space-Based Many-Core Processor, 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology, pp.95-102, 2011. ,
DOI : 10.1109/SMC-IT.2011.29
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006. ,
DOI : 10.1016/j.future.2004.11.016
Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014. ,
DOI : 10.1109/IPDPS.2014.122
The International Exascale Software Project roadmap, The International Journal of High Performance Computing Applications, vol.25, issue.1, pp.3-60, 2011. ,
DOI : 10.2172/471364
Evaluating the Impact of SDC on the GMRES Iterative Solver, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014. ,
DOI : 10.1109/IPDPS.2014.123
Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012. ,
DOI : 10.1109/ICDCS.2012.56
Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004. ,
DOI : 10.1109/TDSC.2004.15
The case for modular redundancy in large-scale high performance computing systems, PDCN. IASTED, 2009. ,
Redundant Execution of HPC Applications with MR-MPI, Parallel and Distributed Computing and Networks / 720: Software Engineering, 2011. ,
DOI : 10.2316/P.2011.719-031
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063443
Detection and correction of silent data corruption for large-scale high-performance computing, SC. ACM, 2012. ,
ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability, Procedia Computer Science, vol.9, pp.166-175, 2012. ,
DOI : 10.1016/j.procs.2012.04.018
Fault-tolerant iterative methods via selective reliability, 2011. ,
Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984. ,
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009. ,
DOI : 10.1007/978-3-540-30218-6_19
The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962. ,
DOI : 10.1147/rd.62.0200
Design, modeling, and evaluation of a scalable multi-level checkpointing system, SC. ACM, 2010. ,
ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013. ,
DOI : 10.1145/2503210.2503266
The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Transactions on Electron Devices, vol.41, issue.4, pp.553-557, 1994. ,
DOI : 10.1109/16.278509
Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), 2007. ,
DOI : 10.1109/MSST.2007.4367962
Supporting highly-decoupled thread-level redundancy for parallel programs, 2008 IEEE 14th International Symposium on High Performance Computer Architecture, pp.393-404, 2008. ,
DOI : 10.1109/HPCA.2008.4658655
Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013. ,
DOI : 10.1145/2530268.2530272
Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007. ,
DOI : 10.1088/1742-6596/78/1/012022
Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, 2012. ,
DOI : 10.1145/2304576.2304588
Addressing failures in exascale computing, The International Journal of High Performance Computing Applications, vol.37, issue.13, pp.129-173, 2014. ,
DOI : 10.1016/j.anucene.2010.01.017
Does partial replication pay off? In FTXS, 2012. ,
Programmer-directed partial redundancy for resilient HPC, Proceedings of the 12th ACM International Conference on Computing Frontiers, CF '15, 2015. ,
DOI : 10.1145/1321211.1321241
Using replication and checkpointing for reliable task management in computational Grids, 2010 International Conference on High Performance Computing & Simulation, 2010. ,
DOI : 10.1109/HPCS.2010.5547140
URL : https://hal.archives-ouvertes.fr/hal-00788867
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
Thread-level redundancy fault tolerant CMP based on relaxed input replication, ICCIT. IEEE, 2011. ,
Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009. ,
DOI : 10.1109/CLUSTR.2009.5289177
IBM experiments in soft fails in computer electronics (1978???1994), Inria RESEARCH CENTRE GRENOBLE ? RHÔNE-ALPES Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria.fr ISSN, pp.3-18, 1996. ,
DOI : 10.1147/rd.401.0003