BlobCR: Virtual Disk Based Checkpoint-Restart for HPC Applications on IaaS Clouds

Bogdan Nicolae 1, 2, * Franck Cappello 2, 3, 4, 5
* Auteur correspondant
4 GRAND-LARGE - Global parallel and distributed computing
CNRS - Centre National de la Recherche Scientifique : UMR8623, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LIFL - Laboratoire d'Informatique Fondamentale de Lille, LRI - Laboratoire de Recherche en Informatique
Abstract : Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend-resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back file system changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a real-life HPC application.
Type de document :
Article dans une revue
Journal of Parallel and Distributed Computing, Elsevier, 2013, 73 (5), pp.698-711. 〈10.1016/j.jpdc.2013.01.013〉
Liste complète des métadonnées

Littérature citée [42 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00857964
Contributeur : Bogdan Nicolae <>
Soumis le : mercredi 4 septembre 2013 - 13:08:05
Dernière modification le : jeudi 18 octobre 2018 - 18:30:05
Document(s) archivé(s) le : jeudi 5 décembre 2013 - 04:17:04

Fichier

paper.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Bogdan Nicolae, Franck Cappello. BlobCR: Virtual Disk Based Checkpoint-Restart for HPC Applications on IaaS Clouds. Journal of Parallel and Distributed Computing, Elsevier, 2013, 73 (5), pp.698-711. 〈10.1016/j.jpdc.2013.01.013〉. 〈hal-00857964〉

Partager

Métriques

Consultations de la notice

705

Téléchargements de fichiers

605