Assuming failure independence: are we right to be wrong?

This paper revisits the failure 1 temporal independence hypothesis which is omnipresent in the analysis of resilience methods for HPC. We explain why a previous approach is incorrect , and we propose a new method to detect failure cascades, i.e., series of non-independent consecutive failures. We use this new method to assess whether public archive failure logs contain failure cascades. Then we design and compare several cascade-aware checkpointing algorithms to quantify the maximum gain that could be obtained, and we report extensive simulation results with archive and synthetic failure logs. Altogether, there are a few logs that contain cascades, but we show that the gain that can be achieved from this knowledge is not significant. The conclusion is that we can wrongly, but safely, assume failure independence!

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

fts2017.pdf (597.12 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Guillaume Pallez (Aupy) : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01654639

Soumis le : lundi 4 décembre 2017-11:17:09

Dernière modification le : jeudi 11 mai 2023-11:56:11

Dates et versions

hal-01654639 , version 1 (04-12-2017)

Identifiants

HAL Id : hal-01654639 , version 1
DOI : 10.1109/CLUSTER.2017.24

Citer

Guillaume Aupy, Yves Robert, Frédéric Vivien. Assuming failure independence: are we right to be wrong?. FTS 2017 - 3rd International Workshop on Fault-Tolerant Systems, Sep 2017, Honolulu (HI), United States. pp.1-8, ⟨10.1109/CLUSTER.2017.24⟩. ⟨hal-01654639⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS INRIA UNIV-LYON1 INRIA2 UDL

128 Consultations

123 Téléchargements