G. Aupy, A. Benoit, T. Hérault, Y. Robert, F. Vivien et al., On the Combination of Silent Error Detection and Checkpointing, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, pp.11-20, 2013.
DOI : 10.1109/PRDC.2013.10

URL : https://hal.archives-ouvertes.fr/hal-00836871

L. , B. Gomez, and F. Cappello, Detecting silent data corruption through data dynamic monitoring for scientific applications, Proceedings of PPoPP'14, pp.381-382, 2014.

L. , B. Gomez, and F. Cappello, Detecting and correcting data corruption in stencil applications through multivariate interpolation, Proceedings of FTS'15, 2015.

L. , B. Gomez, and F. Cappello, Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption, Proceedings of HPCC'15, 2015.

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Assessing general-purpose algorithms to cope with fail-stop and silent errors, Proceedings of PMBS, held as part of SC'14, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01358146

A. Benoit, Y. Robert, and S. K. Raina, Efficient checkpoint/verification patterns, The International Journal of High Performance Computing Applications, vol.40, issue.1, 2015.
DOI : 10.1177/1094342015594531

URL : https://hal.archives-ouvertes.fr/ensl-01252342

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical time-stepping schemes, International Journal of High Performance Computing Applications, vol.29, issue.4, pp.10-1177, 2014.
DOI : 10.1177/1094342014532297

E. Berrocal, L. Bautista-gomez, S. Di, Z. Lan, and F. Cappello, Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications, Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC '15, 2015.
DOI : 10.1145/2749246.2749253

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

G. Bronevetsky and B. De-supinski, Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008.
DOI : 10.1145/1375527.1375552

A. Cavelan, S. K. Raina, Y. Robert, and H. Sun, Assessing the impact of partial verifications against silent data corruptions Distributed snapshots: Determining global states of distributed systems, Proceedings of the 44th Annual International Conference on Parallel Processing (ICPP), 2015. [12] K. M. Chandy and L. Lamport, pp.63-75, 1985.

Z. Chen, Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, Proceedings of PPoPP'13, pp.167-176, 2013.

R. Cohen and L. Katzir, The Generalized Maximum Coverage Problem, Information Processing Letters, vol.108, issue.1, pp.15-22, 2008.
DOI : 10.1016/j.ipl.2008.03.017

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

J. Dongarra, P. Beckman, P. Aerts, F. Cappello, T. Lippert et al., The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009.
DOI : 10.1177/1094342009347714

M. Dow, Explicit inverses of Toeplitz and associated matrices, ANZIAM Journal, vol.44, pp.185-215, 2003.
DOI : 10.21914/anziamj.v44i0.493

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp.615-626, 2012.
DOI : 10.1109/ICDCS.2012.56

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.228.2542

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, Proceedings of SC'12, p.78, 2012.

M. R. Garey and D. S. Johnson, Computers and Intractability, a Guide to the Theory of NP- Completeness, 1979.

M. Heroux and M. Hoemmen, Fault-tolerant iterative methods via selective reliability, Research report SAND2011-3915 C, Sandia National Laboratories, 2011.

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984.

H. Kellerer, U. Pferschy, and D. Pisinger, Knapsack Problems, 2004.
DOI : 10.1007/978-3-540-24777-7

G. Lu, Z. Zheng, and A. A. Chien, When is multiversion checkpointing needed, Proceedings of FTXS'13, pp.49-56, 2013.

R. E. Lyons and W. Vanderkulk, The use of triplemodular redundancy to improve computer reliability

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proceedings of SC'10, 2010.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013.
DOI : 10.1145/2503210.2503266

T. O. Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Transactions on Electron Devices, vol.41, issue.4, pp.553-557, 1994.
DOI : 10.1109/16.278509

P. Sao and R. Vuduc, Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013.
DOI : 10.1145/2530268.2530272

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78, 2012.
DOI : 10.1145/2304576.2304588

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin, IBM experiments in soft fails in computer electronics (1978???1994), IBM Journal of Research and Development, vol.40, issue.1, pp.3-18, 1996.
DOI : 10.1147/rd.401.0003

R. N°-8741 and R. Centre-grenoble-?-rhône-alpes, Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399