Failure Analysis and Modeling in Large Multi-site Infrastructures

Abstract : Every large multi-site infrastructure such as Grids and Clouds must implement fault-tolerance mechanisms and smart schedulers to enable continuous operation even when resource failures occur. Evaluating the efficiency of such mechanisms and schedulers requires representative failure models that are able to capture realistic properties of real world failure data. This paper shows that failures in multi-site infrastructures are far from being randomly distributed. We propose a failure model that captures features observed in real failure traces.
Complete list of metadatas

Cited literature [19 references]  Display  Hide  Download

https://hal.inria.fr/hal-01489451
Contributor : Hal Ifip <>
Submitted on : Tuesday, March 14, 2017 - 2:19:15 PM
Last modification on : Thursday, April 18, 2019 - 5:35:09 PM
Long-term archiving on : Thursday, June 15, 2017 - 2:22:46 PM

File

978-3-642-38541-4_10_Chapter.p...
Files produced by the author(s)

Licence


Distributed under a Creative Commons Attribution 4.0 International License

Identifiers

Citation

Tran Minh, Guillaume Pierre. Failure Analysis and Modeling in Large Multi-site Infrastructures. 13th International Conference on Distributed Applications and Interoperable Systems (DAIS), Jun 2013, Florence, Italy. pp.127-140, ⟨10.1007/978-3-642-38541-4_10⟩. ⟨hal-01489451⟩

Share

Metrics

Record views

570

Files downloads

79