A Fault-Tolerant Approach to Distributed Applications

Toan Nguyen 1 Jean-Antoine Desideri 1 Laurentiu Trifan 1
1 OPALE - Optimization and control, numerical algorithms and integration of complex multidiscipline systems governed by PDE
CRISAM - Inria Sophia Antipolis - Méditerranée , JAD - Laboratoire Jean Alexandre Dieudonné : UMR6621
Abstract : Distributed computing infrastructures support system and network fault-tolerance, e.g., grids and clouds. They transparently repair and prevent communication and system software errors. They also allow duplication and migration of jobs and data to prevent hardware failures. However, only limited work has been done so far on application resilience, i.e., the ability to resume normal execution after errors and abnormal executions in distributed environments. This paper addresses issues in application resilience, i.e., fault-tolerance to algorithmic errors and to resource allocation failures. It addresses solutions for error detection and management. It also overviews a platform used to deploy, execute, monitor, restart and resume distributed applications on grids and cloud infrastructures in case of unexpected behavior.
Complete list of metadatas

https://hal.inria.fr/hal-00823329
Contributor : Toan Nguyen <>
Submitted on : Thursday, May 16, 2013 - 4:44:25 PM
Last modification on : Thursday, May 3, 2018 - 1:32:55 PM
Long-term archiving on : Saturday, August 17, 2013 - 5:20:10 AM

File

PDPTA2013.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00823329, version 2

Citation

Toan Nguyen, Jean-Antoine Desideri, Laurentiu Trifan. A Fault-Tolerant Approach to Distributed Applications. Parallel and Distributed Processing Techniques and Applications (PDPTA'13), Jul 2013, Las Vegas, United States. ⟨hal-00823329v2⟩

Share

Metrics

Record views

30

Files downloads

46