Skip to Main content Skip to Navigation

A new flexible Checkpoint/Restart model

Mohamed Slim Bouguerra 1 Denis Trystram 1 Thierry Gautier 1 Jean-Marc Vincent 2
1 MOAIS - PrograMming and scheduling design fOr Applications in Interactive Simulation
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
2 MESCAL - Middleware efficiently scalable
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
Abstract : The utilization of new generation computing platforms like computational grids or desktop grids introduces new challenging problems. In particular, due to the huge number of the involved processors, security and fault-tolerance aspects are key issues that must be taken into account. Coordinated checkpointing is one of the most popular technique to deal with failures in such platforms. The approach of application-directed checkpointing in fault-tolerance puts an incredible strain on the storage system and the communications. This results in large overheads on the execution times of applications that severely impact the performance and the scalability. This work presents a new model of coordinated checkpoint/restart mechanism for several types of computing platforms. Its main feature is that it is independent from the failure law which makes it very flexible. We will show that such a model may be used to determine the optimal periodic checkpoint interval and to reduce the checkpoint overhead through mathematical analysis of reliability. Moreover, unlike most of the existing checkpointing models, the proposed model is able to take into account a variable checkpoint cost. Finally, we report some experiments based on simulations for random failure distributions corresponding to the two most popular laws, namely, the Poisson's process and Weibull's law.
Complete list of metadata

Cited literature [13 references]  Display  Hide  Download
Contributor : Mohamed Slim Bouguerra Connect in order to contact the contributor
Submitted on : Wednesday, December 17, 2008 - 6:50:50 PM
Last modification on : Thursday, December 9, 2021 - 9:08:03 AM
Long-term archiving on: : Thursday, October 11, 2012 - 2:00:08 PM


Files produced by the author(s)


  • HAL Id : inria-00348135, version 1


Mohamed Slim Bouguerra, Denis Trystram, Thierry Gautier, Jean-Marc Vincent. A new flexible Checkpoint/Restart model. [Research Report] RR-6751, INRIA. 2008. ⟨inria-00348135⟩



Les métriques sont temporairement indisponibles