Basic Concepts and Taxonomy of Dependable and Secure Computing, IEEE Trans. Dependable Sec. Comput, vol.1, pp.11-33, 2004. ,
Identifying the right replication level to detect and correct silent errors at scale, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-02082907
Algorithmbased fault tolerance applied to high performance computing, J. Parallel Distrib. Comput, vol.69, pp.410-416, 2009. ,
Transparent Redundant Computing with MPI, 2010. ,
Improving the trust in results of numerical simulations and scientific data analytics, 2015. ,
Toward Exascale Resilience, Int. J. High Performance Computing Applications, vol.23, pp.374-388, 2009. ,
Toward Exascale Resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, p.1, 2014. ,
Using group replication for resilience on exascale systems, Int. Journal of High Performance Computing Applications, vol.28, pp.210-224, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-00668016
On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Comp. Syst, vol.51, pp.7-19, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01199752
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006. ,
Optimization of multi-level checkpoint model for large scale HPC applications, 2014. ,
The International Exascale Software Project Roadmap, Int. J. High Perform. Comput. Appl, vol.25, pp.3-60, 2011. ,
Combining partial redundancy and checkpointing for HPC, 2012. ,
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, pp.97-108, 2004. ,
The case for modular redundancy in large-scale highh performance computing systems, 2009. ,
Redundant execution of HPC applications with MR-MPI, 2011. ,
Evaluating the Viability of Process Replication Reliability for Exascale Systems, PSC'11, 2011. ,
Detection and correction of silent data corruption for large-scale high-performance computing, p.78, 2012. ,
ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability, Procedia Computer Science, vol.9, pp.166-175, 2012. ,
Algorithm-Based Fault Tolerance for Matrix Operations, IEEE Trans. Comput, vol.33, pp.518-528, 1984. ,
, Fault-Tolerance Techniques for High-Performance Computing, 2015.
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009. ,
The use of triple-modular redundancy to improve computer reliability, IBM J. Res. Dev, vol.6, pp.200-209, 1962. ,
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, 2010. ,
ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, Proc. SC'13, 2013. ,
The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, vol.41, pp.553-557, 1994. ,
Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conf. Mass Storage Systems and Technologies, 2007. ,
Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, p.1, 2007. ,
Understanding Failures in Petascale Computers, Journal of Physics: Conference Series, vol.78, p.1, 2007. ,
Fault Tolerant Preconditioned Conjugate Gradient for Sparse Linear System Solution, ICS, 2012. ,
Addressing Failures in Exascale Computing, Int. J. High Perform. Comput. Appl, vol.28, pp.129-173, 2014. ,
Does partial replication pay off, 2012. ,
Programmer-directed Partial Redundancy for Resilient HPC, Computing Frontiers, 2015. ,
Using Replication and Checkpointing for Reliable Task Management in Computational Grids, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00788867
A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, pp.530-531, 1974. ,
Thread-level redundancy fault tolerant CMP based on relaxed input replication, 2011. ,
Reliability-aware scalability models for high performance computing, Cluster Computing, 2009. ,
IBM Experiments in Soft Fails in Computer Electronics, IBM J. Res. Dev, vol.40, pp.3-18, 1996. ,