Skip to Main content Skip to Navigation
Conference papers

Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism

Sheng Di 1, 2 Yves Robert 3, 4 Frédéric Vivien 4, 3 Derrick Kondo 1 Cho-Li Wang 5 Franck Cappello 2, 6, 7, 8 
1 MESCAL - Middleware efficiently scalable
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
3 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
6 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : In this paper, we aim at optimizing fault-tolerance tech- niques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal num- ber of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also at- tractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster en- vironment with hundreds of virtual machines and Berke- ley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall- clock lengths by 50-100 seconds per job on average.
Complete list of metadata

Cited literature [28 references]  Display  Hide  Download
Contributor : Equipe Roma Connect in order to contact the contributor
Submitted on : Wednesday, July 24, 2013 - 9:46:45 AM
Last modification on : Thursday, September 29, 2022 - 2:58:07 PM
Long-term archiving on: : Wednesday, April 5, 2017 - 4:21:13 PM


Files produced by the author(s)



Sheng Di, Yves Robert, Frédéric Vivien, Derrick Kondo, Cho-Li Wang, et al.. Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism. SC13 - Supercomputing - 2013, Nov 2013, Denver, United States. ⟨10.1145/2503210.2503217⟩. ⟨hal-00847635⟩



Record views


Files downloads