L. Bautista-gomez, A. Nukada, N. Maruyama, F. Cappello, and S. Matsuoka, Transparent low-overhead checkpoint for GPU-accelerated clusters

A. S. Bland, R. A. Kendall, D. B. Kothe, J. H. Rogers, and G. M. Shipman, Jaguar: The World's Most Powerful Computer, 2009.

M. Bouguerra, T. Gautier, D. Trystram, and J. Vincent, A Flexible Checkpoint/Restart Model in Distributed Systems, In PPAM LNCS, vol.6067, pp.206-215, 2010.
DOI : 10.1007/978-3-642-14390-8_22

URL : https://hal.archives-ouvertes.fr/hal-00788926

F. Cappello, H. Casanova, and Y. Robert, Checkpointing vs. Migration for Post-Petascale Supercomputers, 2010 39th International Conference on Parallel Processing, 2010.
DOI : 10.1109/ICPP.2010.26

URL : https://hal.archives-ouvertes.fr/inria-00437201

V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi et al., Proactive management of software aging, IBM Journal of Research and Development, vol.45, issue.2, pp.311-332, 2001.
DOI : 10.1147/rd.452.0311

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

J. Dongarra, P. Beckman, P. Aerts, F. Cappello, T. Lippert et al., The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4
DOI : 10.1177/1094342009347714

T. Heath, R. P. Martin, and T. D. Nguyen, Improving cluster availability using workstation validation, ACM SIGMETRICS Performance Evaluation Review, vol.30, issue.1, pp.217-227, 2002.
DOI : 10.1145/511399.511362

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8437

J. C. Ho, C. L. Wang, and F. C. Lau, Scalable group-based checkpoint/restart for large-scale message-passing systems, 2008 IEEE International Symposium on Parallel and Distributed Processing, pp.1-12, 2008.
DOI : 10.1109/IPDPS.2008.4536302

W. M. Jones, J. T. Daly, and N. Debardeleben, Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters, Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pp.276-279, 2010.
DOI : 10.1145/1851476.1851509

N. Kolettis and N. Fulton, Software rejuvenation: Analysis, module and applications, FTCS '95, p.381, 1995.

D. Kondo, B. Javadi, A. Iosup, and D. Epema, The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. Cluster Computing and the Grid, IEEE International Symposium on, vol.0, pp.398-407, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00433523

P. L. Ecuyer and J. Malenfant, Computing optimal checkpointing strategies for rollback and recovery systems, IEEE Transactions on Computers, vol.37, issue.4, pp.491-496, 2002.
DOI : 10.1109/12.2197

Y. Ling, J. Mi, and X. Lin, A variational calculus approach to optimal checkpoint placement, IEEE Transactions on computers, pp.699-708, 2001.

Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun et al., An optimal checkpoint/restart model for a large scale high performance computing system, IPDPS, pp.1-9, 2008.

]. E. Meneses, Clustering Parallel Applications to Enhance Message Logging Protocols

T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle, IEEE Transactions on Dependable and Secure Computing, vol.3, issue.2, pp.130-140, 2006.
DOI : 10.1109/TDSC.2006.22

L. Martin and . Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2005.

V. Sarkar and . Others, Exascale software study: Software challenges in extreme scale systems White paper available at: http://users. ece, 2009.

B. Schroeder and G. A. Gibson, A large-scale study of failures in highperformance computing systems, Proc. of DSN, pp.249-258, 2006.

A. N. Tantawi and M. Ruschitzka, Performance analysis of checkpointing strategies, ACM Transactions on Computer Systems, vol.2, issue.2, pp.123-144, 1984.
DOI : 10.1145/190.357398

S. Toueg and O. Babaoglu, On the Optimum Checkpoint Selection Problem, SIAM Journal on Computing, vol.13, issue.3, pp.630-649, 1984.
DOI : 10.1137/0213039

K. Venkatesh, Analysis of Dependencies of Checkpoint Cost and Checkpoint Interval of Fault Tolerant MPI Applications, Analysis, vol.2, issue.08, pp.2690-2697, 2010.

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115