N. Kolettis and N. D. Fulton, Software rejuvenation: Analysis, module and applications, FTCS '95, p.381, 1995.

K. Taura and A. A. Chien, A heuristic algorithm for mapping communicating tasks on heterogeneous resources, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556), pp.102-115, 2000.
DOI : 10.1109/HCW.2000.843736
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.699

Q. Wu and Y. Gu, Supporting Distributed Application Workflows in Heterogeneous Computing Environments, 2008 14th IEEE International Conference on Parallel and Distributed Systems, 2008.
DOI : 10.1109/ICPADS.2008.40

A. Duda, The effects of checkpointing on program execution time, Information Processing Letters, vol.16, issue.5, pp.221-229, 1983.
DOI : 10.1016/0020-0190(83)90093-5

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

T. Heath, R. P. Martin, and T. D. Nguyen, Improving cluster availability using workstation validation, ACM SIGMETRICS Performance Evaluation Review, vol.30, issue.1, pp.217-227, 2002.
DOI : 10.1145/511399.511362
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8437

B. Schroeder and G. A. Gibson, A large-scale study of failures in high-performance computing systems, Proc. of DSN, pp.249-258, 2006.

Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun et al., An optimal checkpoint/restart model for a large scale high performance computing system, IPDPS, pp.1-9, 2008.

E. Heien, D. Kondo, A. Gainaru, D. Lapine, B. Kramer et al., Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063444

M. Bouguerra, T. Gautier, D. Trystram, and J. Vincent, A Flexible Checkpoint/Restart Model in Distributed Systems, PPAM, ser. LNCS, pp.206-215, 2010.
DOI : 10.1007/978-3-642-14390-8_22
URL : https://hal.archives-ouvertes.fr/hal-00788926

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504

. Int and . Conf, for High Performance Computing, Networking, Storage and Analysis The validity of the single processor approach to achieving large scale computing capabilities, AFIPS Conference Proceedings, pp.483-485, 1967.

M. Bougeret, H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, Using group replication for resilience on exascale systems, International Journal of High Performance Computing Applications, vol.28, issue.2, 2012.
DOI : 10.1177/1094342013505348
URL : https://hal.archives-ouvertes.fr/hal-00881463

M. R. Garey and D. S. Johnson, Computers and Intractability, a Guide to the Theory of NP-Completeness, 1979.

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2001.

P. François-dutot, G. Mounié, and D. Trystram, Scheduling Parallel Tasks Approximation Algorithms, Handbook of Scheduling, p.26, 2004.

M. Bouguerra, D. Trystram, and F. Wagner, Complexity Analysis of Checkpoint Scheduling with Variable Costs, IEEE Transactions on Computers, vol.62, issue.6, p.2012
DOI : 10.1109/TC.2012.57
URL : https://hal.archives-ouvertes.fr/hal-00788101

D. Kondo, B. Javadi, A. Iosup, and D. Epema, The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.398-407, 2010.
DOI : 10.1109/CCGRID.2010.71
URL : https://hal.archives-ouvertes.fr/inria-00433523

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

W. Jones, J. Daly, and N. Debardeleben, Impact of suboptimal checkpoint intervals on application efficiency in computational clusters, HPDC'10, pp.276-279, 2010.

K. Venkatesh, Analysis of Dependencies of Checkpoint Cost and Checkpoint Interval of Fault Tolerant MPI Applications, Analysis, vol.2, issue.08, pp.2690-2697, 2010.

A. Tantawi and M. Ruschitzka, Performance analysis of checkpointing strategies, ACM Transactions on Computer Systems, vol.2, issue.2, pp.123-144, 1984.
DOI : 10.1145/190.357398

J. Dongarra, E. Jeannot, E. Saule, and Z. Shi, Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures , SPAA '07, pp.280-288, 2007.
DOI : 10.1145/1248377.1248423
URL : https://hal.archives-ouvertes.fr/hal-00155964

A. Dogan and F. Ozgüner, Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing, IEEE Transactions on Parallel and Distributed Systems, vol.13, issue.3, pp.308-323, 2002.
DOI : 10.1109/71.993209

A. Girault, E. Saule, and D. Trystram, Reliability versus performance for critical applications, Journal of Parallel and Distributed Computing, vol.69, issue.3, pp.326-336, 2009.
DOI : 10.1016/j.jpdc.2008.11.002
URL : https://hal.archives-ouvertes.fr/hal-00753169

S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho, Using replication and checkpointing for reliable task management in computational Grids, 2010 International Conference on High Performance Computing & Simulation, 2010.
DOI : 10.1109/HPCS.2010.5547140
URL : https://hal.archives-ouvertes.fr/hal-00788867

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063443