Resilience for Collaborative Applications on Clouds - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2012

Resilience for Collaborative Applications on Clouds

Résumé

Because e-Science applications are data intensive and require long execution runs, it is important that they feature fault-tolerance mechanisms. Cloud and grid computing infrastructures often support system and network fault-tolerance. They repair and prevent communication and software errors. They allow also checkpointing of applications, duplication of jobs and data to prevent catastrophic hardware failures. However, only preliminary work has been done so far on application resilience, i.e., the ability to resume normal execution following application errors and abnormal executions. This paper is an overview of open issues and solutions for such errors detection and management. It also overviews the implementation of a workflow management system to design, deploy, execute, monitor, restart and resume distributed HPC applications on cloud infrastructures in cases of failures.
Fichier principal
Vignette du fichier
ICCSA2012.pdf (435.37 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00700571 , version 1 (23-05-2012)

Identifiants

  • HAL Id : hal-00700571 , version 1

Citer

Toan Nguyen, Jean-Antoine Desideri. Resilience for Collaborative Applications on Clouds. ICCSA2012 - 12th International Conference on Computational Science and Its Applications, Universidade Federal de Bahia, Jun 2012, Salvador de Bahia, Brazil. pp.418-433. ⟨hal-00700571⟩
104 Consultations
186 Téléchargements

Partager

Gmail Facebook X LinkedIn More