G. Aupy, A. Benoit, H. Casanova, and Y. Robert, Scheduling computational workflows on failure-prone platforms, 17th Workshop on Advances in Parallel and Distributed Computational Models (APDCM'15), 2015.
URL : https://hal.archives-ouvertes.fr/hal-01251939

P. Balaprakash, L. A. Gomez, M. Bouguerra, S. M. Wild, F. Cappello et al., Analysis of the tradeoffs between energy and run time for multilevel checkpointing, Proc. PMBS'14, 2014.

L. Bautista-gomez, A. Benoit, A. Cavelan, S. K. Raina, Y. Robert et al., Which verification for soft error detection?, Proc. HiPC'15, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01252382

L. Bautista-gomez, A. Benoit, A. Cavelan, S. K. Raina, Y. Robert et al., Coping with recall and precision of soft error detectors, Journal of Parallel and Distributed Computing, vol.98, pp.8-24, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01246639

L. Bautista-gomez and F. Cappello, Detecting silent data corruption through data dynamic monitoring for scientific applications, SIGPLAN Notices, vol.49, pp.381-382, 2014.

L. Bautista-gomez and F. Cappello, Detecting and correcting data corruption in stencil applications through multivariate interpolation, Proc.1st Int. Workshop on Fault Tolerant Systems (FTS), 2015.

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., FTI: High performance fault tolerance interface for hybrid systems, Proc. SC'11, 2011.
URL : https://hal.archives-ouvertes.fr/hal-01298430

A. Benoit, A. Cavelan, V. Le-fèvre, Y. Robert, and H. Sun, Towards optimal multi-level checkpointing, IEEE Transactions on Computers, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01339788

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Assessing general-purpose algorithms to cope with fail-stop and silent errors, ACM Trans. Parallel Computing, vol.3, issue.2, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01066664

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Optimal resilience patterns to cope with failstop and silent errors, Proc. IPDPS'16, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01215857

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Two-level checkpointing and partial verifications for linear task graphs, PDSEC'2016, the 17th Workshop on Parallel and Distributed Scientific and Engineering Computing, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01252400

A. Benoit, Y. Robert, and S. K. Raina, Efficient checkpoint/verification patterns, Int. J. High Performance Computing Applications, 2015.
URL : https://hal.archives-ouvertes.fr/ensl-01252342

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical time-stepping schemes, Int. J. High Performance Computing Applications, 2014.

E. Berrocal, L. Bautista-gomez, S. Di, Z. Lan, and F. Cappello, Lightweight silent data corruption detection based on runtime data analysis for HPC applications, Proc. HPDC, 2015.

G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra et al., Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00696154

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, J. Parallel Distrib. Comput, vol.69, issue.4, pp.410-416, 2009.

G. Bronevetsky and B. De-supinski, Soft error vulnerability of iterative linear algebra methods, Proceedings of the International Conference on Supercomputing (ICS), pp.155-164, 2008.

F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer et al., Toward Exascale Resilience, Int. Journal of High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009.

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward exascale resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, issue.1, 2014.

A. Cavelan, S. K. Raina, Y. Robert, and H. Sun, Assessing the impact of partial verifications against silent data corruptions, Proc. ICPP, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01253493

K. M. Chandy and L. Lamport, Distributed snapshots: Determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.

Z. Chen, Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, Proc. PPoPP, pp.167-176, 2013.

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2001.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006.

S. Di, M. S. Bouguerra, L. Bautista-gomez, and F. Cappello, Optimization of multi-level checkpoint model for large scale HPC applications, Proc. IPDPS'14, 2014.

S. Di, Y. Robert, F. Vivien, and F. Cappello, Toward an optimal online checkpoint solution under a two-level HPC checkpoint model, IEEE Trans. Parallel & Distributed Systems, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01263879

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining partial redundancy and checkpointing for HPC, Proceedings of the IEEE International Conference on Distributed Computing Systems (ICDCS), pp.615-626, 2012.

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Survey, vol.34, pp.375-408, 2002.

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the Viability of Process Replication Reliability for Exascale Systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, vol.44, p.12, 2011.

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, Proc. SC'12, p.78, 2012.

R. G. Gallager, Stochastic Processes: Theory for Applications, 2014.

D. Hakkarinen and Z. Chen, Multilevel diskless checkpointing, IEEE Transactions on Computers, vol.62, issue.4, pp.772-783, 2013.

M. Heroux and M. Hoemmen, Fault-tolerant iterative methods via selective reliability, 2011.

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984.

A. A. Hwang, I. A. Stefanovici, and B. Schroeder, Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design, SIGARCH Comput. Archit. News, vol.40, issue.1, pp.111-122, 2012.

G. Lu, Z. Zheng, and A. A. Chien, When is multi-version checkpointing needed?, Proc. 3rd Workshop on Fault-tolerance for HPC at extreme scale (FTXS), pp.49-56, 2013.

R. E. Lyons and W. Vanderkulk, The use of triple-modular redundancy to improve computer reliability, IBM J. Res. Dev, vol.6, issue.2, pp.200-209, 1962.

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10), pp.1-11, 2010.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, Proc. SC'13, 2013.

T. O'gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, vol.41, issue.4, pp.553-557, 1994.

J. Plank, K. Li, and M. Puening, Diskless checkpointing. IEEE Trans. Parallel Dist. Systems, vol.9, issue.10, pp.972-986, 1998.

F. Quaglia, A cost model for selecting checkpoint positions in time warp parallel simulation, IEEE Trans. Parallel Dist. Syst, vol.12, issue.4, pp.346-362, 2001.

P. Sao and R. Vuduc, Self-stabilizing iterative solvers, Proc. ScalA '13, 2013.

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proc. ICS, pp.69-78, 2012.

L. Silva and J. Silva, Using two-level stable storage for efficient checkpointing, IEE Proceedings -Software, vol.145, issue.6, pp.198-202, 1998.

S. Toueg and Ö. Babaoglu, On the optimum checkpoint selection problem, SIAM J. Comput, vol.13, issue.3, pp.630-649, 1984.

N. H. Vaidya, A case for two-level distributed recovery schemes. SIGMETRICS Perform, Eval. Rev, vol.23, issue.1, pp.64-73, 1995.

J. W. Young, A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, issue.9, pp.530-531, 1974.

J. Ziegler, M. Nelson, J. Shell, R. Peterson, C. Gelderloos et al., Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits, vol.33, issue.2, pp.246-252, 1998.

J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin, IBM experiments in soft fails in computer electronics, IBM J. Res. Dev, vol.40, issue.1, pp.3-18, 1996.