G. Zheng, L. Shi, and L. V. Kale, FTC-Charm++: an inmemory checkpoint-based fault tolerant runtime for Charm++ and MPI, Proc. 2004 IEEE Int. Conf. Cluster Computing, 2004.

X. Ni, E. Meneses, and L. V. Kalé, Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm, 2012 IEEE International Conference on Cluster Computing, 2012.
DOI : 10.1109/CLUSTER.2012.82

J. Dongarra, P. Beckman, P. Aerts, F. Cappello, T. Lippert et al., The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009.
DOI : 10.1177/1094342009347714

F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer et al., Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009.
DOI : 10.1177/1094342009347767

A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, MPICH-V: a multiprotocol fault tolerant MPI, IJHPCA, vol.20, issue.3, pp.319-333, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00688637

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

Y. Ling, J. Mi, and X. Lin, A variational calculus approach to optimal checkpoint placement, IEEE Trans. Computers, pp.699-708, 2001.

T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, Distributionfree checkpoint placement algorithms based on min-max principle, IEEE TDSC, pp.130-140, 2006.

M. Bouguerra, T. Gautier, D. Trystram, and J. Vincent, A Flexible Checkpoint/Restart Model in Distributed Systems, PPAM, ser. LNCS, pp.206-215, 2010.
DOI : 10.1007/978-3-642-14390-8_22

URL : https://hal.archives-ouvertes.fr/hal-00788926

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

S. Toueg and O. Babaoglu, On the Optimum Checkpoint Selection Problem, SIAM Journal on Computing, vol.13, issue.3, pp.630-649, 1984.
DOI : 10.1137/0213039

M. Bouguerra, D. Trystram, and F. Wagner, Complexity Analysis of Checkpoint Scheduling with Variable Costs, IEEE Transactions on Computers, vol.62, issue.6, 2012.
DOI : 10.1109/TC.2012.57

URL : https://hal.archives-ouvertes.fr/hal-00788101

J. S. Plank and M. G. Thomason, Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems, Journal of Parallel and Distributed Computing, vol.61, issue.11, p.1590, 2001.
DOI : 10.1006/jpdc.2001.1757

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

H. Jin, Y. Chen, H. Zhu, and X. Sun, Optimizing HPC fault-folerant environment: an analytical approach, ICPP' 2010, 2010.
DOI : 10.1109/icpp.2010.80

L. Wang, P. Karthik, Z. Kalbarczyk, R. Iyer, L. Votta et al., Modeling Coordinated Checkpointing for Large-Scale Supercomputers, 2005 International Conference on Dependable Systems and Networks (DSN'05), pp.812-821, 2005.
DOI : 10.1109/DSN.2005.67

R. Oldfield, S. Arunagiri, P. Teller, S. Seelam, M. Varela et al., Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), pp.30-46, 2007.
DOI : 10.1109/MSST.2007.4367962

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, pp.1-9, 2009.
DOI : 10.1109/CLUSTR.2009.5289177

F. Cappello, H. Casanova, and Y. Robert, PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS, Parallel Processing Letters, vol.21, issue.02, pp.111-132, 2011.
DOI : 10.1142/S0129626411000126

URL : https://hal.archives-ouvertes.fr/hal-00945068