On the Combination of Silent Error Detection and Checkpointing

Abstract : In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.
Type de document :
Communication dans un congrès
PRDC - The 19th IEEE Pacific Rim International Symposium on Dependable Computing - 2013, Dec 2013, Vancouver, Canada. IEEE, 2013
Liste complète des métadonnées

Littérature citée [27 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00847620
Contributeur : Equipe Roma <>
Soumis le : mercredi 24 juillet 2013 - 09:09:46
Dernière modification le : mardi 16 janvier 2018 - 15:36:33
Document(s) archivé(s) le : mercredi 5 avril 2017 - 16:22:30

Fichier

resilience2013.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00847620, version 1

Collections

Citation

Guillaume Aupy, Anne Benoit, Thomas Hérault, Yves Robert, Frédéric Vivien, et al.. On the Combination of Silent Error Detection and Checkpointing. PRDC - The 19th IEEE Pacific Rim International Symposium on Dependable Computing - 2013, Dec 2013, Vancouver, Canada. IEEE, 2013. 〈hal-00847620〉

Partager

Métriques

Consultations de la notice

356

Téléchargements de fichiers

111