Skip to Main content Skip to Navigation

Assessing the Impact of Partial Verifications Against Silent Data Corruptions

Abstract : Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. When a silent error strikes, it is not detected immediately but only after some delay, which prevents the use of pure periodic checkpointing approaches devised for fail-stop errors. Instead, checkpointing must be coupled with some verification mechanism to guarantee that corrupted data will never be written into the checkpoint file. Such a guaranteed verification mechanism typically incurs a high cost. In this paper, we investigate the use of partial verification mechanisms in addition to a guaranteed verification. The main objective is to investigate to which extent it is worthwhile to use some light-cost but less precise verification type in the middle of a periodic computing pattern, which ends with a guaranteed verification right before each checkpoint. Introducing partial verifications dramatically complicates the analysis, but we are able to analytically determine the optimal computing pattern (up to first-order approximation), including the optimal length of the pattern, the optimal number of partial verifications, as well as their optimal positions inside the pattern. Simulations based on a wide range of parameters confirm the benefits of partial verifications in certain scenarios, when compared to the baseline algorithm that uses only guaranteed verifications.
Complete list of metadata

Cited literature [28 references]  Display  Hide  Download
Contributor : Equipe Roma <>
Submitted on : Wednesday, June 17, 2015 - 2:00:58 PM
Last modification on : Monday, November 16, 2020 - 9:56:04 AM
Long-term archiving on: : Tuesday, April 25, 2017 - 11:15:13 AM


Files produced by the author(s)


  • HAL Id : hal-01143832, version 2



Aurélien Cavelan, Saurabh K. Raina, Yves Robert, Hongyang Sun. Assessing the Impact of Partial Verifications Against Silent Data Corruptions. [Research Report] RR-8711, INRIA Grenoble - Rhône-Alpes; ENS Lyon; Université Lyon 1; Jaypee Institute of Information Technology, India; CNRS - Lyon (69); University of Tennessee Knoxville, USA; INRIA. 2015. ⟨hal-01143832v2⟩



Record views


Files downloads