Skip to Main content Skip to Navigation
Conference papers

A Fault-Tolerant Approach to Distributed Applications

Toan Nguyen 1 Jean-Antoine Desideri 1 Laurentiu Trifan 1 
1 OPALE - Optimization and control, numerical algorithms and integration of complex multidiscipline systems governed by PDE
CRISAM - Inria Sophia Antipolis - Méditerranée , JAD - Laboratoire Jean Alexandre Dieudonné : UMR6621
Abstract : Distributed computing infrastructures support system and network fault-tolerance, e.g., grids and clouds. They transparently repair and prevent communication and system software errors. They also allow duplication and migration of jobs and data to prevent hardware failures. However, only limited work has been done so far on application resilience, i.e., the ability to resume normal execution after errors and abnormal executions in distributed environments. This paper addresses issues in application resilience, i.e., fault-tolerance to algorithmic errors and to resource allocation failures. It addresses solutions for error detection and management. It also overviews a platform used to deploy, execute, monitor, restart and resume distributed applications on grids and cloud infrastructures in case of unexpected behavior.
Complete list of metadata

Cited literature [29 references]  Display  Hide  Download
Contributor : Toan Nguyen Connect in order to contact the contributor
Submitted on : Thursday, June 13, 2013 - 2:21:25 PM
Last modification on : Friday, August 5, 2022 - 3:50:27 AM
Long-term archiving on: : Saturday, September 14, 2013 - 4:13:49 AM


Files produced by the author(s)


  • HAL Id : hal-00823329, version 3


Toan Nguyen, Jean-Antoine Desideri, Laurentiu Trifan. A Fault-Tolerant Approach to Distributed Applications. Parallel and Distributed Processing Techniques and Applications (PDPTA'13), Jul 2013, Las Vegas, United States. ⟨hal-00823329v3⟩



Record views


Files downloads