Which Verification for Soft Error Detection?

Many methods are available to detect silent errors in high-performance computing (HPC) applications. Each comes with a given cost and recall (fraction of all errors that are actually detected). The main contribution of this paper is to show which detector(s) to use, and to characterize the optimal computational pattern for the application: how many detectors of each type to use, together with the length of the work segment that precedes each of them. We conduct a comprehensive complexity analysis of this optimization problem, showing NP-completeness and designing an FPTAS (Fully Polynomial-Time Approximation Scheme). On the practical side, we provide a greedy algorithm whose performance is shown to be close to the optimal for a realistic set of evaluation scenarios.

De nombreuses méthodes sont disponibles pour détecter les erreurs silencieuses dans les applications de Calcul Haute Performance (HPC). Chaque méthode a un coût et un rappel (fraction de toutes les erreurs qui sont effectivement détectées). La principale contribution de cet article est de montrer quel(s) détecteur(s) utiliser, et de caractériser le motif de calcul optimale pour une application: combien de détecteurs de chaque type utiliser, ainsi que la longueur du segment de travail qui les précède. Nous menons une analyse de complexité exhaustive de ce problème d'optimisation, montrant sa NP-complétude et la conception d'une FPTAS (Fully Polynomial-Time Approximation Scheme). Sur le plan pratique, nous fournissons un algorithme glouton dont la performance est montrée comme étant proche de l'optimal pour un ensemble réaliste de scénarios d'évaluation.

Mots clés

fault tolerance high performance computing silent data corruption partial verification supercomputer

supercalculateur checkpoint vérification partielle corruption de donnée silencieuse tolérance aux pannes Calcul Haute Performance erreur silencieuse exascale

Domaines

Informatique [cs] Performance et fiabilité [cs.PF] Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

RR-8741.pdf (973.15 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Equipe Roma : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01164445

Soumis le : lundi 5 octobre 2015-18:55:26

Dernière modification le : mardi 6 février 2024-11:09:05

Archivage à long terme le : mercredi 26 avril 2017-22:21:54

Dates et versions

hal-01164445 , version 1 (16-06-2015)

hal-01164445 , version 2 (05-10-2015)

Identifiants

HAL Id : hal-01164445 , version 2

Citer

Leonardo Bautista-Gomez, Anne Benoit, Aurélien Cavelan, Saurabh K. Raina, Yves Robert, et al.. Which Verification for Soft Error Detection?. [Research Report] RR-8741, INRIA Grenoble; ENS Lyon; Jaypee Institute of Information Technology, India; Argonne National Laboratory; University of Tennessee Knoxville, USA; INRIA. 2015, pp.20. ⟨hal-01164445v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS INRIA UNIV-LYON1 INRIA-RRRT INRIA2 LARA UDL

320 Consultations

273 Téléchargements