A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004. ,
DOI : 10.1016/j.future.2004.11.016
PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS, Parallel Processing Letters, vol.21, issue.02, pp.111-132, 2011. ,
DOI : 10.1142/S0129626411000126
URL : https://hal.archives-ouvertes.fr/hal-00945068
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063443
A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), p.2012 ,
DOI : 10.1109/DSNW.2012.6264677
Improving cluster availability using workstation validation, SIGMETRICS Perf. Eval. Rev, vol.30, issue.1, 2002. ,
DOI : 10.1145/511399.511362
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8437
A large-scale study of failures in highperformance computing systems, Proc. of DSN, pp.249-258, 2006. ,
Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063444
Practical online failure prediction for Blue Gene/P: Period-based vs event-driven, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W), pp.259-264, 2011. ,
DOI : 10.1109/DSNW.2011.5958823
A practical failure prediction with location and lead time for Blue Gene/P, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), pp.15-22, 2010. ,
DOI : 10.1109/DSNW.2010.5542627
Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012. ,
DOI : 10.1109/IPDPS.2012.107
Failure Prediction in IBM BlueGene/L Event Logs, Seventh IEEE International Conference on Data Mining (ICDM 2007), pp.583-588, 2007. ,
DOI : 10.1109/ICDM.2007.46
Predicting computer system failures using support vector machines, Proceedings of the First USENIX conference on Analysis of system logs. USENIX Association, 2008. ,
Exploring event correlation for failure prediction in coalitions of clusters, Proceedings of the 2007 ACM/IEEE conference on Supercomputing , SC '07, pp.411-4112, 2007. ,
DOI : 10.1145/1362622.1362678
Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399 ,