G. Aupy, A. Benoit, H. Casanova, and Y. Robert, Scheduling Computational Workflows on Failure-Prone Platforms, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015.
DOI : 10.1109/IPDPSW.2015.33

URL : https://hal.archives-ouvertes.fr/hal-01075100

P. Balaprakash, L. A. Gomez, M. Bouguerra, S. M. Wild, F. Cappello et al., Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing, Proc. PMBS'14, 2014.
DOI : 10.1007/978-3-319-17248-4_13

L. Bautista-gomez, A. Benoit, A. Cavelan, S. K. Raina, Y. Robert et al., Which Verification for Soft Error Detection?, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC), 2015.
DOI : 10.1109/HiPC.2015.26

URL : https://hal.archives-ouvertes.fr/hal-01252382

L. Bautista-gomez, A. Benoit, A. Cavelan, S. K. Raina, Y. Robert et al., Coping with recall and precision of soft error detectors, Journal of Parallel and Distributed Computing, vol.98, pp.8-24, 2016.
DOI : 10.1016/j.jpdc.2016.07.007

URL : https://hal.archives-ouvertes.fr/hal-01246639

L. , B. Gomez, and F. Cappello, Detecting silent data corruption through data dynamic monitoring for scientific applications, SIGPLAN Notices, vol.49, issue.8, pp.381-382, 2014.

L. , B. Gomez, and F. Cappello, Detecting and correcting data corruption in stencil applications through multivariate interpolation, Proc.1st Int. Workshop on Fault Tolerant Systems (FTS), 2015.
DOI : 10.1109/cluster.2015.108

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063427

URL : https://hal.archives-ouvertes.fr/hal-00721216

A. Benoit, A. Cavelan, V. Le-fèvre, Y. Robert, and H. Sun, Towards Optimal Multi-Level Checkpointing, IEEE Transactions on Computers, 2017.
DOI : 10.1109/TC.2016.2643660

URL : https://hal.archives-ouvertes.fr/hal-01339788

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Assessing general-purpose algorithms to cope with fail-stop and silent errors, ACM Trans. Parallel Computing, vol.3, issue.2, p.2016
DOI : 10.1145/2897189

URL : https://hal.archives-ouvertes.fr/hal-01066664

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Optimal resilience patterns to cope with failstop and silent errors, Proc. IPDPS'16, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01354886

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Two-level checkpointing and partial verifications for linear task graphs, PDSEC'2016, the 17th Workshop on Parallel and Distributed Scientific and Engineering Computing, 2016.
DOI : 10.1109/ipdpsw.2016.106

URL : https://hal.archives-ouvertes.fr/hal-01252400

A. Benoit, Y. Robert, and S. K. Raina, Efficient checkpoint/verification patterns, High Performance Computing Applications, 2015.
DOI : 10.1177/1094342015594531

URL : https://hal.archives-ouvertes.fr/ensl-01252342

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical time-stepping schemes, High Performance Computing Applications, 2014.
DOI : 10.1177/1094342014532297

URL : http://arxiv.org/abs/1312.2674

E. Berrocal, L. Bautista-gomez, S. Di, Z. Lan, and F. Cappello, Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications, Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC '15, 2015.
DOI : 10.1145/2749246.2749253

G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra et al., Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, 2013.
DOI : 10.1002/cpe.3173

URL : https://hal.archives-ouvertes.fr/hal-00696154

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

G. Bronevetsky and B. De-supinski, Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008.
DOI : 10.1145/1375527.1375552

URL : http://www.osti.gov/scitech/servlets/purl/923619

F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer et al., Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009.
DOI : 10.1177/1094342009347767

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, p.2014
DOI : 10.1177/1094342009347767

A. Cavelan, S. K. Raina, Y. Robert, and H. Sun, Assessing the Impact of Partial Verifications against Silent Data Corruptions, 2015 44th International Conference on Parallel Processing, 2015.
DOI : 10.1109/ICPP.2015.53

URL : https://hal.archives-ouvertes.fr/hal-01253493

K. M. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.
DOI : 10.1145/214451.214456

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.119.7694

Z. Chen, Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, Proc. PPoPP, pp.167-176, 2013.

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2001.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

S. Di, M. S. Bouguerra, L. Bautista-gomez, and F. Cappello, Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014.
DOI : 10.1109/IPDPS.2014.122

S. Di, Y. Robert, F. Vivien, and F. Cappello, Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model, IEEE Transactions on Parallel and Distributed Systems, vol.28, issue.1, 2016.
DOI : 10.1109/TPDS.2016.2546248

URL : https://hal.archives-ouvertes.fr/hal-01263879

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp.615-626, 2012.
DOI : 10.1109/ICDCS.2012.56

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.228.2542

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.441-4412, 2011.
DOI : 10.1145/2063384.2063443

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, Proc. SC'12, p.78, 2012.

R. G. Gallager, Stochastic Processes: Theory for Applications

D. Hakkarinen and Z. Chen, Multilevel Diskless Checkpointing, IEEE Transactions on Computers, vol.62, issue.4, pp.772-783, 2013.
DOI : 10.1109/TC.2012.17

M. Heroux and M. Hoemmen, Fault-tolerant iterative methods via selective reliability, Research report SAND2011-3915 C, Sandia National Laboratories, 2011.

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984.

A. A. Hwang, I. A. Stefanovici, and B. Schroeder, Cosmic rays don't strike twice, ACM SIGARCH Computer Architecture News, vol.40, issue.1, pp.111-122, 2012.
DOI : 10.1145/2189750.2150989

G. Lu, Z. Zheng, and A. A. Chien, When is multi-version checkpointing needed?, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13, pp.49-56, 2013.
DOI : 10.1145/2465813.2465821

R. E. Lyons and W. Vanderkulk, The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10), pp.1-11, 2010.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013.
DOI : 10.1145/2503210.2503266

T. O. Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Transactions on Electron Devices, vol.41, issue.4, pp.553-557, 1994.
DOI : 10.1109/16.278509

J. Plank, K. Li, and M. Puening, Diskless checkpointing, IEEE Transactions on Parallel and Distributed Systems, vol.9, issue.10, pp.972-986, 1998.
DOI : 10.1109/71.730527

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.4662

F. Quaglia, A cost model for selecting checkpoint positions in time warp parallel simulation, IEEE Transactions on Parallel and Distributed Systems, vol.12, issue.4, pp.346-362, 2001.
DOI : 10.1109/71.920586

P. Sao and R. Vuduc, Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013.
DOI : 10.1145/2530268.2530272

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78, 2012.
DOI : 10.1145/2304576.2304588

L. Silva and J. Silva, Using two-level stable storage for efficient checkpointing, IEE Proceedings - Software, vol.145, issue.6, pp.198-202, 1998.
DOI : 10.1049/ip-sen:19982440

URL : http://hdl.handle.net/10316/12927

S. Toueg and Ö. Babaoglu, On the Optimum Checkpoint Selection Problem, SIAM Journal on Computing, vol.13, issue.3, pp.630-649, 1984.
DOI : 10.1137/0213039

N. H. Vaidya, A case for two-level distributed recovery schemes, ACM SIGMETRICS Performance Evaluation Review, vol.23, issue.1, pp.64-73, 1995.
DOI : 10.1145/223586.223596

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.3540

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. Ziegler, M. Nelson, J. Shell, R. Peterson, C. Gelderloos et al., Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits, vol.33, issue.2, pp.246-252, 1998.
DOI : 10.1109/4.658626

J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin, IBM experiments in soft fails in computer electronics (1978???1994), IBM Journal of Research and Development, vol.40, issue.1, pp.3-18, 1996.
DOI : 10.1147/rd.401.0003