Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures

Rafael Ferreira da Silva 1, * Tristan Glatard 1 Frédéric Desprez 2
* Corresponding author
1 Images et Modèles
CREATIS - Centre de Recherche en Acquisition et Traitement de l'Image pour la Santé
Abstract : Distributed computing infrastructures are commonly used through scientific gate- ways, but operating these gateways requires important human intervention to handle operational incidents. This report presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. From their degree, incidents are classified in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. We specifically study the long-tail effect issue, and propose a new algorithm to control task replication. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up exe- cution up to a factor of 4, consumes up to 26% less resource time than a control execution and properly detects unrecoverable errors.
Complete list of metadatas

Cited literature [34 references]  Display  Hide  Download

https://hal.inria.fr/hal-00720369
Contributor : Frédéric Desprez <>
Submitted on : Tuesday, July 24, 2012 - 1:34:15 PM
Last modification on : Friday, October 26, 2018 - 10:46:54 AM
Long-term archiving on : Friday, December 16, 2016 - 2:38:52 AM

Files

RR-8022.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00720369, version 1

Citation

Rafael Ferreira da Silva, Tristan Glatard, Frédéric Desprez. Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures. [Research Report] RR-8022, INRIA. 2012, pp.24. ⟨hal-00720369⟩

Share

Metrics

Record views

385

Files downloads

419