Scheduling computational workflows on failure-prone platforms, 17th Workshop on Advances in Parallel and Distributed Computational Models (APDCM'15), 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01251939
Analysis of the tradeoffs between energy and run time for multilevel checkpointing, Proc. PMBS'14, 2014. ,
Which verification for soft error detection?, Proc. HiPC'15, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01252382
Coping with recall and precision of soft error detectors, Journal of Parallel and Distributed Computing, vol.98, pp.8-24, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01246639
Detecting silent data corruption through data dynamic monitoring for scientific applications, SIGPLAN Notices, vol.49, pp.381-382, 2014. ,
Detecting and correcting data corruption in stencil applications through multivariate interpolation, Proc.1st Int. Workshop on Fault Tolerant Systems (FTS), 2015. ,
FTI: High performance fault tolerance interface for hybrid systems, Proc. SC'11, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-01298430
Towards optimal multi-level checkpointing, IEEE Transactions on Computers, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01339788
Assessing general-purpose algorithms to cope with fail-stop and silent errors, ACM Trans. Parallel Computing, vol.3, issue.2, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01066664
Optimal resilience patterns to cope with failstop and silent errors, Proc. IPDPS'16, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01215857
Two-level checkpointing and partial verifications for linear task graphs, PDSEC'2016, the 17th Workshop on Parallel and Distributed Scientific and Engineering Computing, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01252400
Efficient checkpoint/verification patterns, Int. J. High Performance Computing Applications, 2015. ,
URL : https://hal.archives-ouvertes.fr/ensl-01252342
Silent error detection in numerical time-stepping schemes, Int. J. High Performance Computing Applications, 2014. ,
Lightweight silent data corruption detection based on runtime data analysis for HPC applications, Proc. HPDC, 2015. ,
Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00696154
Algorithm-based fault tolerance applied to high performance computing, J. Parallel Distrib. Comput, vol.69, issue.4, pp.410-416, 2009. ,
Soft error vulnerability of iterative linear algebra methods, Proceedings of the International Conference on Supercomputing (ICS), pp.155-164, 2008. ,
Toward Exascale Resilience, Int. Journal of High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009. ,
Toward exascale resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, issue.1, 2014. ,
Assessing the impact of partial verifications against silent data corruptions, Proc. ICPP, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01253493
Distributed snapshots: Determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985. ,
Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, Proc. PPoPP, pp.167-176, 2013. ,
Introduction to Algorithms, 2001. ,
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006. ,
Optimization of multi-level checkpoint model for large scale HPC applications, Proc. IPDPS'14, 2014. ,
Toward an optimal online checkpoint solution under a two-level HPC checkpoint model, IEEE Trans. Parallel & Distributed Systems, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01263879
Combining partial redundancy and checkpointing for HPC, Proceedings of the IEEE International Conference on Distributed Computing Systems (ICDCS), pp.615-626, 2012. ,
A survey of rollback-recovery protocols in message-passing systems, ACM Computing Survey, vol.34, pp.375-408, 2002. ,
Evaluating the Viability of Process Replication Reliability for Exascale Systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, vol.44, p.12, 2011. ,
Detection and correction of silent data corruption for large-scale high-performance computing, Proc. SC'12, p.78, 2012. ,
, Stochastic Processes: Theory for Applications, 2014.
Multilevel diskless checkpointing, IEEE Transactions on Computers, vol.62, issue.4, pp.772-783, 2013. ,
Fault-tolerant iterative methods via selective reliability, 2011. ,
Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984. ,
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design, SIGARCH Comput. Archit. News, vol.40, issue.1, pp.111-122, 2012. ,
When is multi-version checkpointing needed?, Proc. 3rd Workshop on Fault-tolerance for HPC at extreme scale (FTXS), pp.49-56, 2013. ,
The use of triple-modular redundancy to improve computer reliability, IBM J. Res. Dev, vol.6, issue.2, pp.200-209, 1962. ,
Design, modeling, and evaluation of a scalable multi-level checkpointing system, Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10), pp.1-11, 2010. ,
ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, Proc. SC'13, 2013. ,
The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, vol.41, issue.4, pp.553-557, 1994. ,
, Diskless checkpointing. IEEE Trans. Parallel Dist. Systems, vol.9, issue.10, pp.972-986, 1998.
A cost model for selecting checkpoint positions in time warp parallel simulation, IEEE Trans. Parallel Dist. Syst, vol.12, issue.4, pp.346-362, 2001. ,
Self-stabilizing iterative solvers, Proc. ScalA '13, 2013. ,
Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proc. ICS, pp.69-78, 2012. ,
Using two-level stable storage for efficient checkpointing, IEE Proceedings -Software, vol.145, issue.6, pp.198-202, 1998. ,
On the optimum checkpoint selection problem, SIAM J. Comput, vol.13, issue.3, pp.630-649, 1984. ,
A case for two-level distributed recovery schemes. SIGMETRICS Perform, Eval. Rev, vol.23, issue.1, pp.64-73, 1995. ,
A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits, vol.33, issue.2, pp.246-252, 1998. ,
IBM experiments in soft fails in computer electronics, IBM J. Res. Dev, vol.40, issue.1, pp.3-18, 1996. ,