Failure Analysis and Modeling in Large Multi-Site Infrastructures

Minh Tran Ngoc 1 Guillaume Pierre 1
1 MYRIADS - Design and Implementation of Autonomous Distributed Systems
IRISA-D1 - SYSTÈMES LARGE ÉCHELLE, Inria Rennes – Bretagne Atlantique
Abstract : Every large multi-site infrastructure such as Grids and Clouds must implement fault-tolerance mechanisms and smart schedulers to enable continuous operation even when resource failures occur. Evaluating the efficiency of such mechanisms and schedulers requires representative failure models that are able to capture realistic properties of real world failure data. This paper shows that failures in multi-site infrastructures are far from being randomly distributed. We propose a failure model that captures features observed in real failure traces.
Document type :
Conference papers
Complete list of metadatas

Cited literature [19 references]  Display  Hide  Download

https://hal.inria.fr/hal-00804747
Contributor : Guillaume Pierre <>
Submitted on : Tuesday, March 26, 2013 - 11:26:55 AM
Last modification on : Thursday, October 3, 2019 - 10:28:05 AM
Long-term archiving on : Sunday, April 2, 2017 - 8:28:23 PM

File

paper_45.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00804747, version 1

Citation

Minh Tran Ngoc, Guillaume Pierre. Failure Analysis and Modeling in Large Multi-Site Infrastructures. 13th International IFIP Conference on Distributed Applications and Interoperable Systems, IFIP, Jun 2013, Florence, Italy. ⟨hal-00804747⟩

Share

Metrics

Record views

516

Files downloads

348