Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism

Sheng Di 1, 2 Yves Robert 3, 4 Frédéric Vivien 4, 3 Derrick Kondo 1 Cho-Li Wang 5 Franck Cappello 2, 6, 7, 8
1 MESCAL - Middleware efficiently scalable
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
3 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
6 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : In this paper, we aim at optimizing fault-tolerance tech- niques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal num- ber of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also at- tractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster en- vironment with hundreds of virtual machines and Berke- ley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall- clock lengths by 50-100 seconds per job on average.
Type de document :
Communication dans un congrès
SC13 - Supercomputing - 2013, Nov 2013, Denver, United States. ACM, 2013, 〈10.1145/2503210.2503217〉
Liste complète des métadonnées

Littérature citée [28 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00847635
Contributeur : Equipe Roma <>
Soumis le : mercredi 24 juillet 2013 - 09:46:45
Dernière modification le : jeudi 20 juillet 2017 - 09:31:07
Document(s) archivé(s) le : mercredi 5 avril 2017 - 16:21:13

Fichier

adaptive-opt.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Citation

Sheng Di, Yves Robert, Frédéric Vivien, Derrick Kondo, Cho-Li Wang, et al.. Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism. SC13 - Supercomputing - 2013, Nov 2013, Denver, United States. ACM, 2013, 〈10.1145/2503210.2503217〉. 〈hal-00847635〉

Partager

Métriques

Consultations de la notice

514

Téléchargements de fichiers

356