L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063427

URL : https://hal.archives-ouvertes.fr/hal-00721216

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical time-stepping schemes, International Journal of High Performance Computing Applications, vol.29, issue.4, pp.403-421, 2015.
DOI : 10.1177/1094342014532297

URL : http://arxiv.org/abs/1312.2674

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Computer Systems, vol.51, pp.7-19, 2015.
DOI : 10.1016/j.future.2015.04.003

URL : https://hal.archives-ouvertes.fr/hal-01199752

N. Chen and S. Ren, Adaptive optimal checkpoint interval and its impact on system's overall quality in soft real-time applications, Proceedings of the 2009 ACM symposium on Applied Computing, SAC '09, pp.1015-1020, 2009.
DOI : 10.1145/1529282.1529506

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

S. Di, L. Bautista-gomez, and F. Cappello, Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.907-918, 2014.
DOI : 10.1109/SC.2014.79

S. Di, M. Bouguerra, L. Bautista-gomez, and F. Cappello, Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp.1181-1190, 2014.
DOI : 10.1109/IPDPS.2014.122

S. Di and F. Cappello, Fast error-bounded lossy hpc data compression with sz [12] A. Feinberg. An 83,000-processor supercomputer can only match 1% of your brain Available at http://gizmodo.com/ an-83-000-processor-supercomputer-only-matched-one, IEEE 30th International Parallel and Distributed Processing Symposium, 2013.

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, Proceedings of Supercomputing, SC '12, pp.1-78, 2012.

L. Gomez, A. Nukada, N. Maruyama, F. Cappello, and S. Matsuoka, Low-overhead diskless checkpoint for hybrid computing systems, 2010 International Conference on High Performance Computing, pp.1-10, 2010.
DOI : 10.1109/HIPC.2010.5713163

L. A. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka, Distributed Diskless Checkpoint for Large Scale Systems, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.63-72, 2010.
DOI : 10.1109/CCGRID.2010.40

D. Hakkarinen and Z. Chen, Multilevel Diskless Checkpointing, IEEE Transactions on Computers, vol.62, issue.4, pp.772-783, 2013.
DOI : 10.1109/TC.2012.17

H. Jin, Y. Chen, H. Zhu, and X. Sun, Optimizing HPC Fault-Tolerant Environment: An Analytical Approach, 2010 39th International Conference on Parallel Processing, pp.525-534, 2010.
DOI : 10.1109/ICPP.2010.80

A. Kulkarni, A. Manzanares, L. Ionkov, M. Lang, and A. Lumsdaine, The design and implementation of a multi-level content-addressable checkpoint file system, 2012 19th International Conference on High Performance Computing, pp.1-10, 2012.
DOI : 10.1109/HiPC.2012.6507514

H. Li, L. Pang, and Z. Wang, Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads, PLoS ONE, vol.54, issue.8, p.2014
DOI : 10.1371/journal.pone.0104591.t002

Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun et al., An optimal checkpoint/restart model for a large scale high performance computing system, IEEE International Symposium on Parallel and Distributed Processing, 2008.

A. Moody, G. Bronevetsky, K. Mohror, and B. De-supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2010.
DOI : 10.1109/SC.2010.18

N. H. Vaidya, A case for two-level distributed recovery schemes, Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '95/PERFORMANCE '95, pp.64-73, 1995.

J. Walters and V. Chaudhary, Replication-Based Fault Tolerance for MPI Applications, IEEE Transactions on Parallel and Distributed Systems, vol.20, issue.7, pp.997-1010, 2009.
DOI : 10.1109/TPDS.2008.172

J. P. Walters and V. Chaudhary, A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications, Proceedings of the 14th International Conference on High Performance Computing (HiPC'07), pp.257-268, 2007.
DOI : 10.1007/978-3-540-77220-0_26

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

G. Zheng, X. Ni, and L. Kale, A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp.1-6, 2012.
DOI : 10.1109/DSNW.2012.6264677