On the Combination of Silent Error Detection and Checkpointing, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, pp.11-20, 2013. ,
DOI : 10.1109/PRDC.2013.10
URL : https://hal.archives-ouvertes.fr/hal-00836871
Detecting silent data corruption through data dynamic monitoring for scientific applications, Proceedings of PPoPP'14, pp.381-382, 2014. ,
Detecting and correcting data corruption in stencil applications through multivariate interpolation, Proceedings of FTS'15, 2015. ,
Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption, Proceedings of HPCC'15, 2015. ,
Assessing general-purpose algorithms to cope with fail-stop and silent errors, Proceedings of PMBS, held as part of SC'14, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01358146
Efficient checkpoint/verification patterns, The International Journal of High Performance Computing Applications, vol.40, issue.1, 2015. ,
DOI : 10.1177/1094342015594531
URL : https://hal.archives-ouvertes.fr/ensl-01252342
Silent error detection in numerical time-stepping schemes, International Journal of High Performance Computing Applications, vol.29, issue.4, pp.10-1177, 2014. ,
DOI : 10.1177/1094342014532297
Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications, Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC '15, 2015. ,
DOI : 10.1145/2749246.2749253
Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009. ,
DOI : 10.1016/j.jpdc.2008.12.002
Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008. ,
DOI : 10.1145/1375527.1375552
Assessing the impact of partial verifications against silent data corruptions Distributed snapshots: Determining global states of distributed systems, Proceedings of the 44th Annual International Conference on Parallel Processing (ICPP), 2015. [12] K. M. Chandy and L. Lamport, pp.63-75, 1985. ,
Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, Proceedings of PPoPP'13, pp.167-176, 2013. ,
The Generalized Maximum Coverage Problem, Information Processing Letters, vol.108, issue.1, pp.15-22, 2008. ,
DOI : 10.1016/j.ipl.2008.03.017
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006. ,
DOI : 10.1016/j.future.2004.11.016
The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009. ,
DOI : 10.1177/1094342009347714
Explicit inverses of Toeplitz and associated matrices, ANZIAM Journal, vol.44, pp.185-215, 2003. ,
DOI : 10.21914/anziamj.v44i0.493
Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp.615-626, 2012. ,
DOI : 10.1109/ICDCS.2012.56
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.228.2542
A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002. ,
DOI : 10.1145/568522.568525
Detection and correction of silent data corruption for large-scale high-performance computing, Proceedings of SC'12, p.78, 2012. ,
Computers and Intractability, a Guide to the Theory of NP- Completeness, 1979. ,
Fault-tolerant iterative methods via selective reliability, Research report SAND2011-3915 C, Sandia National Laboratories, 2011. ,
Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984. ,
Knapsack Problems, 2004. ,
DOI : 10.1007/978-3-540-24777-7
When is multiversion checkpointing needed, Proceedings of FTXS'13, pp.49-56, 2013. ,
The use of triplemodular redundancy to improve computer reliability ,
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proceedings of SC'10, 2010. ,
ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013. ,
DOI : 10.1145/2503210.2503266
The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Transactions on Electron Devices, vol.41, issue.4, pp.553-557, 1994. ,
DOI : 10.1109/16.278509
Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013. ,
DOI : 10.1145/2530268.2530272
Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78, 2012. ,
DOI : 10.1145/2304576.2304588
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
IBM experiments in soft fails in computer electronics (1978???1994), IBM Journal of Research and Development, vol.40, issue.1, pp.3-18, 1996. ,
DOI : 10.1147/rd.401.0003
Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399 ,