A new flexible Checkpoint/Restart model

Mohamed Slim Bouguerra 1 Denis Trystram 1 Thierry Gautier 1 Jean-Marc Vincent 2
1 MOAIS - PrograMming and scheduling design fOr Applications in Interactive Simulation
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
2 MESCAL - Middleware efficiently scalable
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
Abstract : The utilization of new generation computing platforms like computational grids or desktop grids introduces new challenging problems. In particular, due to the huge number of the involved processors, security and fault-tolerance aspects are key issues that must be taken into account. Coordinated checkpointing is one of the most popular technique to deal with failures in such platforms. The approach of application-directed checkpointing in fault-tolerance puts an incredible strain on the storage system and the communications. This results in large overheads on the execution times of applications that severely impact the performance and the scalability. This work presents a new model of coordinated checkpoint/restart mechanism for several types of computing platforms. Its main feature is that it is independent from the failure law which makes it very flexible. We will show that such a model may be used to determine the optimal periodic checkpoint interval and to reduce the checkpoint overhead through mathematical analysis of reliability. Moreover, unlike most of the existing checkpointing models, the proposed model is able to take into account a variable checkpoint cost. Finally, we report some experiments based on simulations for random failure distributions corresponding to the two most popular laws, namely, the Poisson's process and Weibull's law.
Complete list of metadatas

Cited literature [13 references]  Display  Hide  Download

https://hal.inria.fr/inria-00348135
Contributor : Mohamed Slim Bouguerra <>
Submitted on : Wednesday, December 17, 2008 - 6:50:50 PM
Last modification on : Wednesday, March 13, 2019 - 3:02:06 PM
Long-term archiving on : Thursday, October 11, 2012 - 2:00:08 PM

File

RR-6751.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00348135, version 1

Citation

Mohamed Slim Bouguerra, Denis Trystram, Thierry Gautier, Jean-Marc Vincent. A new flexible Checkpoint/Restart model. [Research Report] RR-6751, INRIA. 2008. ⟨inria-00348135⟩

Share

Metrics

Record views

1343

Files downloads

825