Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2013

Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism

Résumé

In this paper, we aim at optimizing fault-tolerance tech- niques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal num- ber of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also at- tractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster en- vironment with hundreds of virtual machines and Berke- ley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall- clock lengths by 50-100 seconds per job on average.
Fichier principal
Vignette du fichier
adaptive-opt.pdf (798.56 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00847635 , version 1 (24-07-2013)

Identifiants

Citer

Sheng Di, Yves Robert, Frédéric Vivien, Derrick Kondo, Cho-Li Wang, et al.. Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism. SC13 - Supercomputing - 2013, Nov 2013, Denver, United States. ⟨10.1145/2503210.2503217⟩. ⟨hal-00847635⟩
796 Consultations
563 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More