Skip to Main content Skip to Navigation
Reports

Assuming failure independence: are we right to be wrong?

Guillaume Aupy 1 Yves Robert 2, 3 Frédéric Vivien 2
1 TADAAM - Topology-Aware System-Scale Data Management for High-Performance Computing
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest
2 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : This report revisits the failure temporal independence hypothesis which is omnipresent in the analysis of resilience methods for HPC. We explain why a previous approach is incorrect, and we propose a new method to detect failure cascades, i.e., series of non-independent consecutive failures. We use this new method to assess whether public archive failure logs contain failure cascades. Then we design and compare several cascade-aware checkpointing algorithms to quantify the maximum gain that could be obtained, and we report extensive simulation results with archive and synthetic failure logs. Altogether, not only are there but a few logs that contain cascades, but we show that the gain that can be achieved from this knowledge is not significant. The conclusion is that we can wrongly, but safely, assume failure independence!
Document type :
Reports
Complete list of metadatas

Cited literature [23 references]  Display  Hide  Download

https://hal.inria.fr/hal-01556292
Contributor : Equipe Roma <>
Submitted on : Tuesday, July 4, 2017 - 9:45:54 PM
Last modification on : Monday, November 16, 2020 - 9:56:04 AM
Long-term archiving on: : Friday, December 15, 2017 - 3:19:28 AM

File

RR-9078.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01556292, version 1

Citation

Guillaume Aupy, Yves Robert, Frédéric Vivien. Assuming failure independence: are we right to be wrong?. [Research Report] RR-9078, Inria. 2017. ⟨hal-01556292⟩

Share

Metrics

Record views

534

Files downloads

294