Checkpointing as a Service in Heterogeneous Cloud Environments

Abstract : A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when longrunning jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.
Type de document :
Communication dans un congrès
15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CC-GRID 2015), May 2015, Shenzhen, Guangdong, China. 2015, Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CC-GRID 2015). 〈http://cloud.siat.ac.cn/ccgrid2015〉
Liste complète des métadonnées

Littérature citée [29 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01102094
Contributeur : Christine Morin <>
Soumis le : vendredi 15 janvier 2016 - 16:28:40
Dernière modification le : mercredi 16 mai 2018 - 11:23:31
Document(s) archivé(s) le : samedi 16 avril 2016 - 10:59:02

Fichier

ccgrid2015_submission_204-FINA...
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01102094, version 1

Citation

Jiajun Cao, Matthieu Simonin, Gene Cooperman, Christine Morin. Checkpointing as a Service in Heterogeneous Cloud Environments. 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CC-GRID 2015), May 2015, Shenzhen, Guangdong, China. 2015, Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CC-GRID 2015). 〈http://cloud.siat.ac.cn/ccgrid2015〉. 〈hal-01102094〉

Partager

Métriques

Consultations de la notice

570

Téléchargements de fichiers

137