Skip to Main content Skip to Navigation
Reports

Coping with Recall and Precision of Soft Error Detectors

Abstract : Many methods are available to detect silent errors in high-performance computing (HPC) applications. Each comes with a given cost, recall (fraction of all errors that are actually detected, i.e., false negatives), and precision (fraction of true errors amongst all detected errors, i.e., false positives). The main contribution of this paper is to characterize the optimal computing pattern for an application: which detector(s) to use, how many detectors of each type to use, together with the length of the work segment that precedes each of them. We first prove that detectors with imperfect precisions offer limited usefulness. Then we focus on detectors with perfect precision, and we conduct a comprehensive complexity analysis of this optimization problem, showing NP-completeness and designing an FPTAS (Fully Polynomial-Time Approximation Scheme). On the practical side, we provide a greedy algorithm, whose performance is shown to be close to the optimal for a realistic set of evaluation scenarios. Extensive simulations illustrate the usefulness of detectors with false negatives, which are available at a lower cost than guaranteed detectors.
Complete list of metadatas

Cited literature [39 references]  Display  Hide  Download

https://hal.inria.fr/hal-01246639
Contributor : Equipe Roma <>
Submitted on : Friday, December 18, 2015 - 7:13:01 PM
Last modification on : Wednesday, February 26, 2020 - 11:14:31 AM
Long-term archiving on: : Saturday, April 29, 2017 - 10:39:19 PM

File

RR-8832_extended.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01246639, version 1

Collections

Citation

Leonardo Bautista-Gomez, Anne Benoit, Aurélien Cavelan, Saurabh K. Raina, Yves Robert, et al.. Coping with Recall and Precision of Soft Error Detectors. [Research Report] RR-8832, ENS Lyon, CNRS & INRIA. 2015, pp.30. ⟨hal-01246639⟩

Share

Metrics

Record views

295

Files downloads

348