Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism

Sheng Di 1, 2 Yves Robert 3, 4 Frédéric Vivien 4, 3 Derrick Kondo 1 Cho-Li Wang 5 Franck Cappello 2, 6, 7, 8
1 MESCAL - Middleware efficiently scalable
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
3 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
6 GRAND-LARGE - Global parallel and distributed computing
CNRS - Centre National de la Recherche Scientifique : UMR8623, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LIFL - Laboratoire d'Informatique Fondamentale de Lille, LRI - Laboratoire de Recherche en Informatique
Abstract : In this paper, we aim at optimizing fault-tolerance tech- niques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal num- ber of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also at- tractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster en- vironment with hundreds of virtual machines and Berke- ley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall- clock lengths by 50-100 seconds per job on average.
Complete list of metadatas

Cited literature [28 references]  Display  Hide  Download

https://hal.inria.fr/hal-00847635
Contributor : Equipe Roma <>
Submitted on : Wednesday, July 24, 2013 - 9:46:45 AM
Last modification on : Thursday, August 1, 2019 - 2:12:06 PM
Long-term archiving on : Wednesday, April 5, 2017 - 4:21:13 PM

File

adaptive-opt.pdf
Files produced by the author(s)

Identifiers

Collections

Citation

Sheng Di, Yves Robert, Frédéric Vivien, Derrick Kondo, Cho-Li Wang, et al.. Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism. SC13 - Supercomputing - 2013, Nov 2013, Denver, United States. ⟨10.1145/2503210.2503217⟩. ⟨hal-00847635⟩

Share

Metrics

Record views

1268

Files downloads

596