Complexity Analysis of Checkpoint Scheduling with Variable Costs

Mohamed Slim Bouguerra 1 Denis Trystram 2, 3 Frédéric Wagner 2, 3
1 MESCAL - Middleware efficiently scalable
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
3 MOAIS - PrograMming and scheduling design fOr Applications in Interactive Simulation
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
Abstract : The parallel computing platforms available today are increasingly larger and thus, more and more subject to failures. Consequently it is necessary to develop efficient strategies providing safe and reliable completion for HPC parallel applications. Checkpointing is one of the most popular and efficient technique for developing fault-tolerant applications on such context. However, checkpoint operations are costly in terms of time, computation and network communication. This will certainly affect the global performance of the application. In this work, we propose a performance model that expresses formally the checkpoint scheduling problem. This model exhibits the tradeoff between the impact of the checkpoints operations and the lost computation due to failures. Based on this model we study the computational complexity of the problem of scheduling checkpoints with variable costs for general failure distributions. More precisely, we provide a new computational complexity analysis which explicits in depth the relations between the probabilistic failure model, the checkpoint cost and the computational model. In particular, we prove that the checkpoint scheduling problem is NP-hard even in the simple case of uniform failure distribution. We also present a dynamic programming scheme for determining the optimal checkpointing times in all the variants of the problem.
Type de document :
Article dans une revue
IEEE Transactions on Computers, Institute of Electrical and Electronics Engineers, 2012, 99, 〈10.1109/TC.2012.57〉
Liste complète des métadonnées

https://hal.inria.fr/hal-00788101
Contributeur : Arnaud Legrand <>
Soumis le : mercredi 13 février 2013 - 17:05:47
Dernière modification le : lundi 5 octobre 2015 - 16:57:32

Identifiants

Collections

Citation

Mohamed Slim Bouguerra, Denis Trystram, Frédéric Wagner. Complexity Analysis of Checkpoint Scheduling with Variable Costs. IEEE Transactions on Computers, Institute of Electrical and Electronics Engineers, 2012, 99, 〈10.1109/TC.2012.57〉. 〈hal-00788101〉

Partager

Métriques

Consultations de la notice

126