BlobCR: Virtual Disk Based Checkpoint-Restart for HPC Applications on IaaS Clouds - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Article Dans Une Revue Journal of Parallel and Distributed Computing Année : 2013

BlobCR: Virtual Disk Based Checkpoint-Restart for HPC Applications on IaaS Clouds

Résumé

Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend-resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back file system changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a real-life HPC application.
Fichier principal
Vignette du fichier
paper.pdf (263.17 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00857964 , version 1 (04-09-2013)

Identifiants

Citer

Bogdan Nicolae, Franck Cappello. BlobCR: Virtual Disk Based Checkpoint-Restart for HPC Applications on IaaS Clouds. Journal of Parallel and Distributed Computing, 2013, 73 (5), pp.698-711. ⟨10.1016/j.jpdc.2013.01.013⟩. ⟨hal-00857964⟩
473 Consultations
465 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More