Assessing the impact of partial verifications against silent data corruptions

Abstract : Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. When a silent error strikes, it is not detected immediately but only after some delay, which prevents the use of pure periodic check pointing approaches devised for fail-stop errors. Instead, check pointing must be coupled with some verification mechanism to guarantee that corrupted data will never be written into the checkpoint file. Such a guaranteed verification mechanism typically incurs a high cost. In this paper, we assess the impact of using partial verification mechanisms in addition to a guaranteed verification. The main objective is to investigate to which extent it is worthwhile to use some light cost but less accurate verifications in the middle of a periodic computing pattern, which ends with a guaranteed verification right before each checkpoint. Introducing partial verifications dramatically complicates the analysis, but we are able to analytically determine the optimal computing pattern (up to the first-order approximation), including the optimal length of the pattern, the optimal number of partial verifications, as well as their optimal positions inside the pattern. Performance evaluations based on a wide range of parameters confirm the benefit of using partial verifications under certain scenarios, when compared to the baseline algorithm that uses only guaranteed verifications.
Type de document :
Communication dans un congrès
ICPP'2015, The Int. Conference on Parallel Processing, 2015, Beijing, China. IEEE, pp.440-449 44th International Conference on Parallel Processing - ICPP2015. 〈10.1109/ICPP.2015.53〉
Liste complète des métadonnées

https://hal.inria.fr/hal-01253493
Contributeur : Equipe Roma <>
Soumis le : dimanche 10 janvier 2016 - 18:27:11
Dernière modification le : vendredi 20 avril 2018 - 15:44:27

Identifiants

Collections

Citation

Aurélien Cavelan, Saurabh Raina, Yves Robert, Hongyang Sun. Assessing the impact of partial verifications against silent data corruptions. ICPP'2015, The Int. Conference on Parallel Processing, 2015, Beijing, China. IEEE, pp.440-449 44th International Conference on Parallel Processing - ICPP2015. 〈10.1109/ICPP.2015.53〉. 〈hal-01253493〉

Partager

Métriques

Consultations de la notice

161