Skip to Main content Skip to Navigation
Conference papers

Resilience for Collaborative Applications on Clouds

Toan Nguyen 1 Jean-Antoine Desideri 1 
1 OPALE - Optimization and control, numerical algorithms and integration of complex multidiscipline systems governed by PDE
CRISAM - Inria Sophia Antipolis - Méditerranée , JAD - Laboratoire Jean Alexandre Dieudonné : UMR6621
Abstract : Because e-Science applications are data intensive and require long execution runs, it is important that they feature fault-tolerance mechanisms. Cloud and grid computing infrastructures often support system and network fault-tolerance. They repair and prevent communication and software errors. They allow also checkpointing of applications, duplication of jobs and data to prevent catastrophic hardware failures. However, only preliminary work has been done so far on application resilience, i.e., the ability to resume normal execution following application errors and abnormal executions. This paper is an overview of open issues and solutions for such errors detection and management. It also overviews the implementation of a workflow management system to design, deploy, execute, monitor, restart and resume distributed HPC applications on cloud infrastructures in cases of failures.
Complete list of metadata

Cited literature [20 references]  Display  Hide  Download
Contributor : Toan Nguyen Connect in order to contact the contributor
Submitted on : Wednesday, May 23, 2012 - 2:36:03 PM
Last modification on : Saturday, June 25, 2022 - 11:07:47 PM
Long-term archiving on: : Friday, August 24, 2012 - 2:34:10 AM


Files produced by the author(s)


  • HAL Id : hal-00700571, version 1



Toan Nguyen, Jean-Antoine Desideri. Resilience for Collaborative Applications on Clouds. ICCSA2012 - 12th International Conference on Computational Science and Its Applications, Universidade Federal de Bahia, Jun 2012, Salvador de Bahia, Brazil. pp.418-433. ⟨hal-00700571⟩



Record views


Files downloads