BlobCR: Efficient Checkpoint-Restart for HPC Applications on IaaS Clouds using Virtual Disk Image Snapshots

Bogdan Nicolae 1, 2, * Franck Cappello 2, 1
* Auteur correspondant
2 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpoint-restart mechanism becomes paramount in this context. This paper proposes a solution to the aforementioned challenge that aims at minimizing the storage space performance overhead of checkpoint-restart. We introduce a framework that combines checkpoint-restart protocols at guest level with virtual machine (VM) disk-image multi-snapshotting and multi-deployment at host level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing approaches, both for the case when customized checkpointing is available at application level and the case when it needs to be handled at process level.
Type de document :
Communication dans un congrès
SC'11: The 24th International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2011, Seattle, United States. pp.34:1-34:12, 2011, 〈10.1145/2063384.2063429〉
Liste complète des métadonnées

Littérature citée [26 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00601865
Contributeur : Bogdan Nicolae <>
Soumis le : mardi 16 août 2011 - 20:57:59
Dernière modification le : jeudi 11 janvier 2018 - 06:22:14
Document(s) archivé(s) le : lundi 12 novembre 2012 - 15:25:16

Fichier

paper.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Citation

Bogdan Nicolae, Franck Cappello. BlobCR: Efficient Checkpoint-Restart for HPC Applications on IaaS Clouds using Virtual Disk Image Snapshots. SC'11: The 24th International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2011, Seattle, United States. pp.34:1-34:12, 2011, 〈10.1145/2063384.2063429〉. 〈inria-00601865〉

Partager

Métriques

Consultations de la notice

524

Téléchargements de fichiers

534