Software rejuvenation: Analysis, module and applications, FTCS '95, p.381, 1995. ,
Proactive management of software aging, IBM Journal of Research and Development, vol.45, issue.2, pp.311-332, 2001. ,
DOI : 10.1147/rd.452.0311
A heuristic algorithm for mapping communicating tasks on heterogeneous resources, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556), pp.102-115, 2000. ,
DOI : 10.1109/HCW.2000.843736
Supporting Distributed Application Workflows in Heterogeneous Computing Environments, 2008 14th IEEE International Conference on Parallel and Distributed Systems, 2008. ,
DOI : 10.1109/ICPADS.2008.40
The effects of checkpointing on program execution time, Information Processing Letters, vol.16, issue.5, pp.221-229, 1983. ,
DOI : 10.1016/0020-0190(83)90093-5
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004. ,
DOI : 10.1016/j.future.2004.11.016
Improving cluster availability using workstation validation, ACM SIGMETRICS Performance Evaluation Review, vol.30, issue.1, pp.217-227, 2002. ,
DOI : 10.1145/511399.511362
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8437
A large-scale study of failures in highperformance computing systems, Proc. of DSN, pp.249-258, 2006. ,
An optimal checkpoint/restart model for a large scale high performance computing system, IPDPS, pp.1-9, 2008. ,
Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063444
A Flexible Checkpoint/Restart Model in Distributed Systems, PPAM, ser. LNCS, pp.206-215978, 2010. ,
DOI : 10.1007/978-3-642-14390-8_22
URL : https://hal.archives-ouvertes.fr/hal-00788926
Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504
Using group replication for resilience on exascale systems, International Journal of High Performance Computing Applications, vol.28, issue.2, 2012. ,
DOI : 10.1177/1094342013505348
URL : https://hal.archives-ouvertes.fr/hal-00668016
Computers and Intractability, a Guide to the Theory of NP-Completeness, 1979. ,
Introduction to Algorithms, 2001. ,
Scheduling Parallel Tasks Approximation Algorithms, Handbook of Scheduling, p.26, 2004. ,
Complexity Analysis of Checkpoint Scheduling with Variable Costs, IEEE Transactions on Computers, vol.62, issue.6, p.2012 ,
DOI : 10.1109/TC.2012.57
URL : https://hal.archives-ouvertes.fr/hal-00788101
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.398-407, 2010. ,
DOI : 10.1109/CCGRID.2010.71
URL : https://hal.archives-ouvertes.fr/inria-00433523
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters, Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pp.276-279, 2010. ,
DOI : 10.1145/1851476.1851509
Analysis of Dependencies of Checkpoint Cost and Checkpoint Interval of Fault Tolerant MPI Applications, Analysis, vol.2, issue.08, pp.2690-2697, 2010. ,
Performance analysis of checkpointing strategies, ACM Transactions on Computer Systems, vol.2, issue.2, pp.123-144, 1984. ,
DOI : 10.1145/190.357398
Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures , SPAA '07, pp.280-288, 2007. ,
DOI : 10.1145/1248377.1248423
URL : https://hal.archives-ouvertes.fr/hal-00155964
Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing, IEEE Transactions on Parallel and Distributed Systems, vol.13, issue.3, pp.308-323, 2002. ,
DOI : 10.1109/71.993209
Reliability versus performance for critical applications, Journal of Parallel and Distributed Computing, vol.69, issue.3, pp.326-336, 2009. ,
DOI : 10.1016/j.jpdc.2008.11.002
URL : https://hal.archives-ouvertes.fr/hal-00753169
Using replication and checkpointing for reliable task management in computational Grids, 2010 International Conference on High Performance Computing & Simulation, 2010. ,
DOI : 10.1109/HPCS.2010.5547140
URL : https://hal.archives-ouvertes.fr/hal-00788867
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063443
Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399 ,