L. Bautista-gomez, A. Nukada, N. Maruyama, F. Cappello, and S. Matsuoka, Transparent low-overhead checkpoint for GPU-accelerated clusters

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-33, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

M. Bouguerra, T. Gautier, D. Trystram, and J. Vincent, A Flexible Checkpoint/Restart Model in Distributed Systems, In PPAM LNCS, vol.6067, pp.206-215, 2010.
DOI : 10.1007/978-3-642-14390-8_22

URL : https://hal.archives-ouvertes.fr/hal-00788926

V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi et al., Proactive management of software aging, IBM Journal of Research and Development, vol.45, issue.2, pp.311-332, 2001.
DOI : 10.1147/rd.452.0311

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

J. Dongarra, P. Beckman, P. Aerts, F. Cappello, T. Lippert et al., The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009.
DOI : 10.1177/1094342009347714

C. Engelmann, H. H. Ong, and S. L. Scorr, The case for modular redundancy in large-scale highh performance computing systems, Proc. of the 8th IASTED Infernational Conference on Parallel and Distributed Computing and Networks (PDCN), pp.189-194, 2009.

K. Ferreira, J. Stearley, J. H. Laros, I. , R. Oldfield et al., Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters, HPDC'10, pp.276-279, 2010.

N. Kolettis and N. D. Fulton, Software rejuvenation: Analysis, module and applications, FTCS '95, p.381, 1995.

D. Kondo, A. Chien, and H. Casanova, Scheduling Task Parallel Applications for Rapid Turnaround on Enterprise Desktop Grids, Journal of Grid Computing, vol.290, issue.5???6, pp.379-405, 2007.
DOI : 10.1007/s10723-007-9063-y

]. E. Meneses, Clustering Parallel Applications to Enhance Message Logging Protocols

V. Sarkar, Exascale software study: Software challenges in extreme scale systems White paper available at: http://users. ece, 2009.

B. Schroeder and G. Gibson, Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007.
DOI : 10.1088/1742-6596/78/1/012022

B. Schroeder and G. A. Gibson, A large-scale study of failures in highperformance computing systems, Proc. of DSN, pp.249-258, 2006.

K. Venkatesh, Analysis of Dependencies of Checkpoint Cost and Checkpoint Interval of Fault Tolerant MPI Applications, Analysis, vol.2, issue.08, pp.2690-2697, 2010.

L. Wang, P. Karthik, Z. Kalbarczyk, R. Iyer, L. Votta et al., Modeling Coordinated Checkpointing for Large-Scale Supercomputers, 2005 International Conference on Dependable Systems and Networks (DSN'05), pp.812-821, 2005.
DOI : 10.1109/DSN.2005.67

S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho, Using replication and checkpointing for reliable task management in computational Grids, 2010 International Conference on High Performance Computing & Simulation, 2010.
DOI : 10.1109/HPCS.2010.5547140

URL : https://hal.archives-ouvertes.fr/hal-00788867

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009.
DOI : 10.1109/CLUSTR.2009.5289177

R. N°-7830 and R. Centre-grenoble-?-rhône-alpes, Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399