Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism

In this paper, we aim at optimizing fault-tolerance tech- niques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal num- ber of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also at- tractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster en- vironment with hundreds of virtual machines and Berke- ley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall- clock lengths by 50-100 seconds per job on average.

Mots clés

Cloud Computing Checkpoint-Restart Mechanism Optimal Checkpointing Interval Google BLCR

Domaines

Calcul parallèle, distribué et partagé [cs.DC] Performance et fiabilité [cs.PF]

Fichier principal

adaptive-opt.pdf (798.56 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Equipe Roma : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00847635

Soumis le : mercredi 24 juillet 2013-09:46:45

Dernière modification le : jeudi 4 avril 2024-20:50:15

Archivage à long terme le : mercredi 5 avril 2017-16:21:13

Dates et versions

hal-00847635 , version 1 (24-07-2013)

Identifiants

HAL Id : hal-00847635 , version 1
DOI : 10.1145/2503210.2503217

Citer

Sheng Di, Yves Robert, Frédéric Vivien, Derrick Kondo, Cho-Li Wang, et al.. Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism. SC13 - Supercomputing - 2013, Nov 2013, Denver, United States. ⟨10.1145/2503210.2503217⟩. ⟨hal-00847635⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON EC-PARIS UNIV-RENNES1 UNIV-LILLE3 UGA CNRS INRIA UNIV-LYON1 IRISA LIG UMR8623 INRIA2 GENCI UR1-MATH-STIC UNIV-PARIS-SACLAY UR1-UFR-ISTIC UNIV-RENNES UDL ANR UR1-MATH-NUM LIG_SIDCH

796 Consultations

564 Téléchargements