Transparent low-overhead checkpoint for GPU-accelerated clusters ,
Jaguar: The World's Most Powerful Computer, 2009. ,
A Flexible Checkpoint/Restart Model in Distributed Systems, In PPAM LNCS, vol.6067, pp.206-215, 2010. ,
DOI : 10.1007/978-3-642-14390-8_22
URL : https://hal.archives-ouvertes.fr/hal-00788926
Checkpointing vs. Migration for Post-Petascale Supercomputers, 2010 39th International Conference on Parallel Processing, 2010. ,
DOI : 10.1109/ICPP.2010.26
URL : https://hal.archives-ouvertes.fr/inria-00437201
Proactive management of software aging, IBM Journal of Research and Development, vol.45, issue.2, pp.311-332, 2001. ,
DOI : 10.1147/rd.452.0311
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004. ,
DOI : 10.1016/j.future.2004.11.016
The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4 ,
DOI : 10.1177/1094342009347714
Improving cluster availability using workstation validation, ACM SIGMETRICS Performance Evaluation Review, vol.30, issue.1, pp.217-227, 2002. ,
DOI : 10.1145/511399.511362
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8437
Scalable group-based checkpoint/restart for large-scale message-passing systems, 2008 IEEE International Symposium on Parallel and Distributed Processing, pp.1-12, 2008. ,
DOI : 10.1109/IPDPS.2008.4536302
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters, Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pp.276-279, 2010. ,
DOI : 10.1145/1851476.1851509
Software rejuvenation: Analysis, module and applications, FTCS '95, p.381, 1995. ,
The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. Cluster Computing and the Grid, IEEE International Symposium on, vol.0, pp.398-407, 2010. ,
URL : https://hal.archives-ouvertes.fr/inria-00433523
Computing optimal checkpointing strategies for rollback and recovery systems, IEEE Transactions on Computers, vol.37, issue.4, pp.491-496, 2002. ,
DOI : 10.1109/12.2197
A variational calculus approach to optimal checkpoint placement, IEEE Transactions on computers, pp.699-708, 2001. ,
An optimal checkpoint/restart model for a large scale high performance computing system, IPDPS, pp.1-9, 2008. ,
Clustering Parallel Applications to Enhance Message Logging Protocols ,
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle, IEEE Transactions on Dependable and Secure Computing, vol.3, issue.2, pp.130-140, 2006. ,
DOI : 10.1109/TDSC.2006.22
Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2005. ,
Exascale software study: Software challenges in extreme scale systems White paper available at: http://users. ece, 2009. ,
A large-scale study of failures in highperformance computing systems, Proc. of DSN, pp.249-258, 2006. ,
Performance analysis of checkpointing strategies, ACM Transactions on Computer Systems, vol.2, issue.2, pp.123-144, 1984. ,
DOI : 10.1145/190.357398
On the Optimum Checkpoint Selection Problem, SIAM Journal on Computing, vol.13, issue.3, pp.630-649, 1984. ,
DOI : 10.1137/0213039
Analysis of Dependencies of Checkpoint Cost and Checkpoint Interval of Fault Tolerant MPI Applications, Analysis, vol.2, issue.08, pp.2690-2697, 2010. ,
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115