Basic concepts and taxonomy of dependable and secure computing, IEEE Trans. Dependable Sec. Comput, vol.1, issue.1, pp.11-33, 2004. ,
Which verification for soft error detection? In HiPC, 2015. ,
Detecting silent data corruption through data dynamic monitoring for scientific applications, PPoPP, 2014. ,
Detecting and correcting data corruption in stencil applications through multivariate interpolation, FTS. IEEE, 2015. ,
Exploiting spatial smoothness in HPC applications to detect silent data corruption, 2015. ,
Identifying the right replication level to detect and correct silent errors at scale, Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop. ACM, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-02082907
Towards optimal multi-level checkpointing, IEEE Transactions on Computers, vol.66, issue.7, pp.1212-1226, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-02082416
Assessing general-purpose algorithms to cope with fail-stop and silent errors, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01066664
Optimal resilience patterns to cope with fail-stop and silent errors, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01215857
Silent error detection in numerical time-stepping schemes, Int. J. High Performance Computing Applications, 2014. ,
Lightweight silent data corruption detection based on runtime data analysis for HPC applications, 2015. ,
Algorithm-based fault tolerance applied to high performance computing, J. Parallel Distrib. Comput, vol.69, issue.4, pp.410-416, 2009. ,
Soft error vulnerability of iterative linear algebra methods, ICS. ACM, 2008. ,
Improving the trust in results of numerical simulations and scientific data analytics, 2015. ,
Toward Exascale Resilience, Int. J. High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009. ,
Toward Exascale Resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, issue.1, 2014. ,
Using group replication for resilience on exascale systems, Int. Journal of High Performance Computing Applications, vol.28, issue.2, pp.210-224, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-00881463
On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Comp. Syst, vol.51, pp.7-19, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01199752
RAID: Highperformance, reliable secondary storage, ACM Comput. Surv, vol.26, issue.2, pp.145-185, 1994. ,
Application-level fault tolerance in the orbital thermal imaging spectrometer, 2004. ,
Programming models and development software for a space-based many-core processor, 4th Int. Conf. onon Space Mission Challenges for Information Technology, pp.95-102, 2011. ,
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006. ,
Optimization of multi-level checkpoint model for large scale HPC applications, 2014. ,
The international exascale software project roadmap, Int. J. High Perform. Comput. Appl, vol.25, issue.1, pp.3-60, 2011. ,
Evaluating the impact of SDC on the GMRES iterative solver, IPDPS. IEEE, 2014. ,
Combining partial redundancy and checkpointing for HPC, ICDCS, 2012. ,
Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004. ,
The case for modular redundancy in large-scale high performance computing systems, 2009. ,
Redundant execution of HPC applications with MR-MPI, PDCN. IASTED, 2011. ,
Evaluating the viability of process replication reliability for exascale systems, SC'11, 2011. ,
Detection and correction of silent data corruption for large-scale high-performance computing, 2012. ,
ADFT: An adaptive framework for fault tolerance on large scale systems using application malleability, Procedia Computer Science, vol.9, pp.166-175, 2012. ,
Fault-tolerant iterative methods via selective reliability. Research report SAND2011-3915 C, Sandia Nat. Lab, 2011. ,
Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984. ,
Fault-Tolerance Techniques for High-Performance Computing, Computer Communications and Networks, 2015. ,
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009. ,
Toward I/O-efficient protection against silent data corruptions in RAID arrays, 30th Symposium on Mass Storage Systems and Technologies (MSST), pp.1-12, 2014. ,
The use of triple-modular redundancy to improve computer reliability, IBM J. Res. Dev, vol.6, issue.2, pp.200-209, 1962. ,
Design, modeling, and evaluation of a scalable multi-level checkpointing system, 2010. ,
ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, 2013. ,
The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, vol.41, issue.4, pp.553-557, 1994. ,
Modeling the impact of checkpoints on next-generation systems, 24th IEEE Conf. Mass Storage Systems and Technologies, 2007. ,
Supporting highly-decoupled thread-level redundancy for parallel programs, 14th Int. Conf. on High-Performance Computer Architecture (HPCA), pp.393-404, 2008. ,
Self-stabilizing iterative solvers, ScalA '13, 2013. ,
Understanding Failures in Petascale Computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007. ,
Fault tolerant preconditioned conjugate gradient for sparse linear system solution, ICS, 2012. ,
Addressing failures in exascale computing, Int. J. High Perform. Comput. Appl, vol.28, issue.2, pp.129-173, 2014. ,
Does partial replication pay off? In FTXS, 2012. ,
Programmer-directed partial redundancy for resilient HPC, Computing Frontiers, 2015. ,
Using replication and checkpointing for reliable task management in computational grids, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00788867
A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
Thread-level redundancy fault tolerant CMP based on relaxed input replication, 2011. ,
Reliability-aware scalability models for high performance computing, Cluster Computing, 2009. ,
IBM experiments in soft fails in computer electronics, Appendix A. Proof of Theorem, vol.40, issue.1, pp.3-18, 1996. ,