A new flexible Checkpoint/Restart model

Mohamed Slim Bouguerra 1 Denis Trystram 1 Thierry Gautier 1 Jean-Marc Vincent 2
1 MOAIS - PrograMming and scheduling design fOr Applications in Interactive Simulation
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
2 MESCAL - Middleware efficiently scalable
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
Abstract : The utilization of new generation computing platforms like computational grids or desktop grids introduces new challenging problems. In particular, due to the huge number of the involved processors, security and fault-tolerance aspects are key issues that must be taken into account. Coordinated checkpointing is one of the most popular technique to deal with failures in such platforms. The approach of application-directed checkpointing in fault-tolerance puts an incredible strain on the storage system and the communications. This results in large overheads on the execution times of applications that severely impact the performance and the scalability. This work presents a new model of coordinated checkpoint/restart mechanism for several types of computing platforms. Its main feature is that it is independent from the failure law which makes it very flexible. We will show that such a model may be used to determine the optimal periodic checkpoint interval and to reduce the checkpoint overhead through mathematical analysis of reliability. Moreover, unlike most of the existing checkpointing models, the proposed model is able to take into account a variable checkpoint cost. Finally, we report some experiments based on simulations for random failure distributions corresponding to the two most popular laws, namely, the Poisson's process and Weibull's law.
Type de document :
Rapport
[Research Report] RR-6751, INRIA. 2008
Liste complète des métadonnées


https://hal.inria.fr/inria-00348135
Contributeur : Mohamed Slim Bouguerra <>
Soumis le : mercredi 17 décembre 2008 - 18:50:50
Dernière modification le : samedi 17 septembre 2016 - 01:38:19
Document(s) archivé(s) le : jeudi 11 octobre 2012 - 14:00:08

Fichier

RR-6751.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00348135, version 1

Citation

Mohamed Slim Bouguerra, Denis Trystram, Thierry Gautier, Jean-Marc Vincent. A new flexible Checkpoint/Restart model. [Research Report] RR-6751, INRIA. 2008. <inria-00348135>

Partager

Métriques

Consultations de
la notice

901

Téléchargements du document

375