Assessing the Impact of Partial Verifications Against Silent Data Corruptions

Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. When a silent error strikes, it is not detected immediately but only after some delay, which prevents the use of pure periodic checkpointing approaches devised for fail-stop errors. Instead, checkpointing must be coupled with some verification mechanism to guarantee that corrupted data will never be written into the checkpoint file. Such a guaranteed verification mechanism typically incurs a high cost. In this paper, we investigate the use of partial verification mechanisms in addition to a guaranteed verification. The main objective is to investigate to which extent it is worthwhile to use some light-cost but less precise verification type in the middle of a periodic computing pattern, which ends with a guaranteed verification right before each checkpoint. Introducing partial verifications dramatically complicates the analysis, but we are able to analytically determine the optimal computing pattern (up to first-order approximation), including the optimal length of the pattern, the optimal number of partial verifications, as well as their optimal positions inside the pattern. Simulations based on a wide range of parameters confirm the benefits of partial verifications in certain scenarios, when compared to the baseline algorithm that uses only guaranteed verifications.

Les erreurs silencieuses, ou corruptions de données silencieuses, constituent une menace majeure pour les plateformes à très grande échelle. Lorsqu'une erreur frappe, elle n'est pas détectée immédiatement mais seulement après un certain laps de temps, ce qui rend inutilisable l'approche à base de checkpoint périodique pur, recommandée pour les pannes. A la place, il faut coupler les checkpoints à un mécanisme de vérification afin de garantir qu'aucune donnée corrompue ne sera écrite dans le fichier de checkpoint. Un tel mécanisme de vérification garantie est associé à un coût élevé. Dans ce rapport, nous étudions l'utilisation de vérifications partielles en plus de vérifications garanties. L'objectif principal est d'étudier jusqu'à quel point il peut être rentable d'utiliser un mécanisme de vérification à faible coût mais moins précis au milieu d'un motif de calcul périodique, avec une vérification garantie juste avant chaque checkpoint. L'introduction de vérifications partielles complique considérablement l'analyse, mais nous sommes en mesure de calculer analytiquement le motif de calcul optimal (avec une approximation du premier ordre), notamment la longueur optimale du motif, le nombre optimal de vérifications partielles ainsi que leur position optimale respectives à l'intérieur du motif. Des simulations basées sur un large choix de paramètres confirment les avantages des vérifications partielles dans certains scénarios, comparées à un algorithme utilisant seulement des vérifications garanties.

Domaines

Informatique [cs] Performance et fiabilité [cs.PF]

Fichier principal

RR-8711.pdf (1005.51 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Equipe Roma : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01143832

Soumis le : mercredi 17 juin 2015-14:00:58

Dernière modification le : mardi 6 février 2024-11:09:05

Archivage à long terme le : mardi 25 avril 2017-11:15:13

Dates et versions

hal-01143832 , version 1 (20-04-2015)

hal-01143832 , version 2 (17-06-2015)

Identifiants

HAL Id : hal-01143832 , version 2

Citer

Aurélien Cavelan, Saurabh K. Raina, Yves Robert, Hongyang Sun. Assessing the Impact of Partial Verifications Against Silent Data Corruptions. [Research Report] RR-8711, INRIA Grenoble - Rhône-Alpes; ENS Lyon; Université Lyon 1; Jaypee Institute of Information Technology, India; CNRS - Lyon (69); University of Tennessee Knoxville, USA; INRIA. 2015. ⟨hal-01143832v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS INRIA UNIV-LYON1 INRIA-RRRT INRIA2 LARA UDL

133 Consultations

169 Téléchargements