J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

F. Cappello, H. Casanova, and Y. Robert, PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS, Parallel Processing Letters, vol.21, issue.02, pp.111-132, 2011.
DOI : 10.1142/S0129626411000126

URL : https://hal.archives-ouvertes.fr/hal-00945068

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063443

G. Zheng, X. Ni, and L. Kale, A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), p.2012
DOI : 10.1109/DSNW.2012.6264677

T. Heath, R. P. Martin, and T. D. Nguyen, Improving cluster availability using workstation validation, SIGMETRICS Perf. Eval. Rev, vol.30, issue.1, 2002.
DOI : 10.1145/511399.511362

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8437

B. Schroeder and G. A. Gibson, A large-scale study of failures in highperformance computing systems, Proc. of DSN, pp.249-258, 2006.

E. Heien, D. Kondo, A. Gainaru, D. Lapine, B. Kramer et al., Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063444

L. Yu, Z. Zheng, Z. Lan, and S. Coghlan, Practical online failure prediction for Blue Gene/P: Period-based vs event-driven, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W), pp.259-264, 2011.
DOI : 10.1109/DSNW.2011.5958823

Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman, A practical failure prediction with location and lead time for Blue Gene/P, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), pp.15-22, 2010.
DOI : 10.1109/DSNW.2010.5542627

A. Gainaru, F. Cappello, and W. Kramer, Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012.
DOI : 10.1109/IPDPS.2012.107

Y. Liang, Y. Zhang, H. Xiong, and R. K. Sahoo, Failure Prediction in IBM BlueGene/L Event Logs, Seventh IEEE International Conference on Data Mining (ICDM 2007), pp.583-588, 2007.
DOI : 10.1109/ICDM.2007.46

E. W. Fulp, G. A. Fink, and J. N. Haack, Predicting computer system failures using support vector machines, Proceedings of the First USENIX conference on Analysis of system logs. USENIX Association, 2008.

S. Fu and C. Xu, Exploring event correlation for failure prediction in coalitions of clusters, Proceedings of the 2007 ACM/IEEE conference on Supercomputing , SC '07, pp.411-4112, 2007.
DOI : 10.1145/1362622.1362678

R. N°-8023 and R. Centre-grenoble-?-rhône-alpes, Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399