A Flexible Checkpoint/Restart Model in Distributed Systems

Mohamed Slim Bouguerra; Thierry Gautier; Denis Trystram; Jean-Marc Vincent

Communication Dans Un Congrès Année : 2009

A Flexible Checkpoint/Restart Model in Distributed Systems

(1) , (1) , (2, 1) , (3)

1
2
3

Mohamed Slim Bouguerra

Fonction : Auteur correspondant
PersonId : 856730

Connectez-vous pour contacter l'auteur

PrograMming and scheduling design fOr Applications in Interactive Simulation

Thierry Gautier

Fonction : Auteur

PrograMming and scheduling design fOr Applications in Interactive Simulation

Denis Trystram

Fonction : Auteur
PersonId : 5762
IdHAL : denis-trystram
ORCID : 0000-0002-2623-6922
IdRef : 029778301

Institut universitaire de France

PrograMming and scheduling design fOr Applications in Interactive Simulation

Jean-Marc Vincent

Fonction : Auteur
PersonId : 750922
IdHAL : jean-marc-vincent
ORCID : 0000-0003-3576-2024

Middleware efficiently scalable

Résumé

Large scale applications running on new computing plat- forms with thousands of processors have to face with reliability prob- lems. The failure of a single processor will cause the entire execution to fail. Most existing approaches to guarantee reliable executions are based on fault tolerance mechanisms. Coordinated checkpointing is one of the most popular technique to deal with failures in such platforms. This work presents a new model of coordinated Checkpoint/Restart mechanism for several types of computing platforms. The model is parametrized by the process failure distribution, the cost to save a global consistent state of processes and the number of computational resources. Through mathe- matical analysis of reliability, we apply this new model to compute the optimal interval between checkpoint dates in order to minimize the av- erage completion time. Its main feature is that it is independent from the type of the failure law which makes it very exible. We show that such a model may be used to reduce the checkpoint rate up to 20% in same cases and up to factor 4 the total overhead in same cases. Finally, we report some experiments based on simulations for random failure distributions corresponding to the two most popular laws, namely, the Poisson\'s process and Weibull's law.

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Arnaud Legrand : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00788926

Soumis le : vendredi 15 février 2013-13:46:31

Dernière modification le : lundi 15 avril 2024-11:25:23

Dates et versions

hal-00788926 , version 1 (15-02-2013)

Identifiants

HAL Id : hal-00788926 , version 1

Citer

Mohamed Slim Bouguerra, Thierry Gautier, Denis Trystram, Jean-Marc Vincent. A Flexible Checkpoint/Restart Model in Distributed Systems. Proceedings of the 8th International IEEE Conference on Parallel Processing and Applied Mathematics (PPAM'09), 2009, Wroclaw, Poland. pp.206-215. ⟨hal-00788926⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 UGA CNRS INRIA IRISA LIG LIG_SRCPR LIG_SRCPR_MOAIS INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM LIG_SIDCH

267 Consultations

0 Téléchargements

A Flexible Checkpoint/Restart Model in Distributed Systems

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager