J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

A. Gainaru, F. Cappello, W. Kramer, and M. Snir, Fault prediction under the microscope: A closer look into HPC systems, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, p.12, 2012.
DOI : 10.1109/SC.2012.57

G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni, Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, 2013.
DOI : 10.1016/j.jpdc.2013.10.010

URL : https://hal.archives-ouvertes.fr/hal-00788313

L. Yu, Z. Zheng, Z. Lan, and S. Coghlan, Practical online failure prediction for Blue Gene/P: Period-based vs event-driven, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W), 2011.
DOI : 10.1109/DSNW.2011.5958823

Y. Liang, Y. Zhang, H. Xiong, and R. K. Sahoo, Failure Prediction in IBM BlueGene/L Event Logs, Seventh IEEE International Conference on Data Mining (ICDM 2007), pp.583-588, 2007.
DOI : 10.1109/ICDM.2007.46

N. Kolettis and N. D. Fulton, Software rejuvenation: Analysis, module and applications, FTCS '95, p.381, 1995.

J. Hong, S. Kim, Y. Cho, H. Yeom, and T. Park, On the choice of checkpoint interval using memory usage profile and adaptive time series analysis, Proc. Pacific Rim Int. Symp. on Dependable Computing, 2001.

Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman, A practical failure prediction with location and lead time for Blue Gene/P, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), pp.15-22, 2010.
DOI : 10.1109/DSNW.2010.5542627

F. Cappello, H. Casanova, and Y. Robert, Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Par, Proc. Letters, pp.111-132, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00945068

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, p.11, 2011.
DOI : 10.1145/2063384.2063443

G. Zheng, X. Ni, and L. Kale, A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), 2012.
DOI : 10.1109/DSNW.2012.6264677

T. Heath, R. P. Martin, and T. D. Nguyen, Improving cluster availability using workstation validation. SIGMETRICS Perf, Eval. Rev, vol.30, issue.1, 2002.

B. Schroeder and G. A. Gibson, A large-scale study of failures in highperformance computing systems, In: Proc. of DSN, pp.249-258, 2006.

E. Heien, D. Kondo, A. Gainaru, D. Lapine, B. Kramer et al., Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, p.11, 2011.
DOI : 10.1145/2063384.2063444

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

M. S. Bouguerra, T. Gautier, D. Trystram, and J. M. Vincent, A Flexible Checkpoint/Restart Model in Distributed Systems, In: PPAM. LNCS, vol.6067, pp.206-215, 2010.
DOI : 10.1007/978-3-642-14390-8_22

URL : https://hal.archives-ouvertes.fr/hal-00788926

A. Gainaru, F. Cappello, and W. Kramer, Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012.
DOI : 10.1109/IPDPS.2012.107

E. W. Fulp, G. A. Fink, and J. N. Haack, Predicting computer system failures using support vector machines, Proceedings of the First USENIX conference on Analysis of system logs, USENIX Association, 2008.

Y. Li, Z. Lan, P. Gujrati, and X. Sun, Fault-aware runtime strategies for high-performance computing, IEEE TPDS, vol.20, issue.4, pp.460-473, 2009.