G. Zheng, X. Ni, and L. Kale, A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
DOI : 10.1109/DSNW.2012.6264677

K. Sato, A. Moody, K. Mohror, T. Gamblin, B. R. De-supinski et al., Design and modeling of a non-blocking checkpointing system, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012.
DOI : 10.1109/SC.2012.46

E. W. Fulp, G. A. Fink, and J. N. Haack, Predicting computer system failures using support vector machines, Proceedings of the First USENIX conference on Analysis of system logs, USENIX Association, 2008.

A. Gainaru, F. Cappello, and W. Kramer, Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012.
DOI : 10.1109/IPDPS.2012.107

A. Gainaru, F. Cappello, W. Kramer, and M. Snir, Fault prediction under the microscope: A closer look into HPC systems, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012.
DOI : 10.1109/SC.2012.57

Y. Liang, Y. Zhang, H. Xiong, and R. K. Sahoo, Failure Prediction in IBM BlueGene/L Event Logs, Seventh IEEE International Conference on Data Mining (ICDM 2007), pp.583-588, 2007.
DOI : 10.1109/ICDM.2007.46

L. Yu, Z. Zheng, Z. Lan, and S. Coghlan, Practical online failure prediction for Blue Gene/P: Period-based vs event-driven, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W), pp.259-264, 2011.
DOI : 10.1109/DSNW.2011.5958823

Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman, A practical failure prediction with location and lead time for Blue Gene/P, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), pp.15-22, 2010.
DOI : 10.1109/DSNW.2010.5542627

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

N. Kolettis and N. D. Fulton, Software rejuvenation: Analysis, module and applications, in: FTCS '95, IEEE CS, p.381, 1995.

V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi et al., Proactive management of software aging, IBM Journal of Research and Development, vol.45, issue.2, pp.311-332, 2001.
DOI : 10.1147/rd.452.0311

J. Hong, S. Kim, Y. Cho, H. Yeom, and T. Park, On the choice of checkpoint interval using memory usage profile and adaptive time series analysis, Proc. Pacific Rim Int. Symp. on Dependable Computing, 2001.

J. Wingstrom, Overcoming The Difficulties Created By The Volatile Nature Of Desktop Grids Through Understanding, Prediction And Redundancy, 2009.

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

Y. Robert, F. Vivien, and D. Zaidouni, On the complexity of scheduling checkpoints for computational workflowss, in: FTXS'2012, the Workshop on Fault-Tolerance for HPC at Extreme Scale, conjunction with the 42nd Annual IEEE/IFIP Int. Conf. on Dependable Systems and Networks, 2012.

M. Mitzenmacher and E. , Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005.
DOI : 10.1017/CBO9780511813603

T. Heath, R. P. Martin, and T. D. Nguyen, Improving cluster availability using workstation validation, SIGMETRICS Perf, Eval. Rev, vol.30, issue.1

B. Schroeder and G. A. Gibson, A large-scale study of failures in highperformance computing systems, Proc. of DSN, pp.249-258, 2006.

Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun et al., An optimal checkpoint/restart model for a large scale high performance computing system, IPDPS'08, 2008.

E. Heien, D. Kondo, A. Gainaru, D. Lapine, B. Kramer et al., Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063444

]. D. Kondo, B. Javadi, A. Iosup, and D. Epema, The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems, Cluster Computing and the Grid, IEEE International Symposium, issue.0, pp.398-407, 2010.

F. Cappello, H. Casanova, and Y. Robert, PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS, Parallel Processing Letters, vol.21, issue.02, pp.111-132, 2011.
DOI : 10.1142/S0129626411000126

URL : https://hal.archives-ouvertes.fr/hal-00945068

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063443

Y. Li, Z. Lan, P. Gujrati, and X. Sun, Fault-aware runtime strategies for highperformance computing, Parallel and Distributed Systems, IEEE Transactions on, vol.20, issue.4, pp.460-473, 2009.

S. M. Ross, Introduction to Probability Models, Tenth Edition, 2009.