The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009. ,
DOI : 10.1177/1094342009347714
Exascale software study: Software challenges in extreme scale systems, white paper available at: http://users.ece.gatech, 2009. ,
Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004. ,
DOI : 10.1109/TDSC.2004.15
Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1 ,
DOI : 10.1088/1742-6596/78/1/012022
A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) ,
DOI : 10.1109/DSNW.2012.6264677
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063443
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004. ,
DOI : 10.1016/j.future.2004.11.016
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters, Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pp.276-279, 2010. ,
DOI : 10.1145/1851476.1851509
Analysis of Dependencies of Checkpoint Cost and Checkpoint Interval of Fault Tolerant MPI Applications, Analysis, vol.2, issue.08, pp.2690-2697, 2010. ,
A Flexible Checkpoint/Restart Model in Distributed Systems, LNCS, vol.6067, pp.206-215, 2010. ,
DOI : 10.1007/978-3-642-14390-8_22
URL : https://hal.archives-ouvertes.fr/hal-00788926
Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504
Improving cluster availability using workstation validation, ACM SIGMETRICS Performance Evaluation Review, vol.30, issue.1, pp.217-227, 2002. ,
DOI : 10.1145/511399.511362
A large-scale study of failures in highperformance computing systems, Proc. of DSN, pp.249-258, 2006. ,
An optimal checkpoint/restart model for a large scale high performance computing system, IPDPS, pp.1-9, 2008. ,
Fundamentals of fault-tolerant distributed computing in asynchronous environments, ACM Computing Surveys, vol.31, issue.1 ,
DOI : 10.1145/311531.311532
Scheduling Task Parallel Applications for Rapid Turnaround on Enterprise Desktop Grids, Journal of Grid Computing, vol.290, issue.5???6, pp.379-405, 2007. ,
DOI : 10.1007/s10723-007-9063-y
Using replication and checkpointing for reliable task management in computational Grids, 2010 International Conference on High Performance Computing & Simulation, 2010. ,
DOI : 10.1109/HPCS.2010.5547140
URL : https://hal.archives-ouvertes.fr/hal-00788867
Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1 ,
DOI : 10.1088/1742-6596/78/1/012022
Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009. ,
DOI : 10.1109/CLUSTR.2009.5289177
The case for modular redundancy in large-scale highh performance computing systems, Proc. of the 8th IASTED Infernational Conference on Parallel and Distributed Computing and Networks (PDCN), pp.189-194, 2009. ,
Software rejuvenation: Analysis, module and applications, in: FTCS '95, IEEE CS, p.381, 1995. ,
Proactive management of software aging, IBM Journal of Research and Development, vol.45, issue.2, pp.311-332, 2001. ,
DOI : 10.1147/rd.452.0311
Modeling Coordinated Checkpointing for Large-Scale Supercomputers, 2005 International Conference on Dependable Systems and Networks (DSN'05), pp.812-821, 2005. ,
DOI : 10.1109/DSN.2005.67
On Ramanujan's Q-function, Journal of Computational and Applied Mathematics, vol.58, issue.1, pp.103-116, 1995. ,
DOI : 10.1016/0377-0427(93)E0258-N
See applications run and throughput jump: The case for redundant computing in HPC, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), pp.29-34, 2010. ,
DOI : 10.1109/DSNW.2010.5542625
The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems, Cluster Computing and the Grid, IEEE International Symposium, issue.0, pp.398-407, 2010. ,
Using group replication for resilience on exascale systems, International Journal of High Performance Computing Applications, vol.28, issue.2, 2011. ,
DOI : 10.1177/1094342013505348
URL : https://hal.archives-ouvertes.fr/hal-00881463