A comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication

Valentin Le Fèvre; Thomas Herault; Julien Langou; Yves Robert

Communication Dans Un Congrès Année : 2020

A comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication

(1, 2) , (3) , (4) , (1, 2, 3)

1
2
3
4

Valentin Le Fèvre

Fonction : Auteur

Optimisation des ressources : modèles, algorithmes et ordonnancement

Laboratoire de l'Informatique du Parallélisme

Thomas Herault

Fonction : Auteur

Innovative Computing Laboratory [Knoxville]

Julien Langou

Fonction : Auteur

University of Colorado [Denver]

Yves Robert

Fonction : Auteur
PersonId : 739318
IdHAL : yves-robert
ORCID : 0000-0003-2361-055X
IdRef : 029813611

Optimisation des ressources : modèles, algorithmes et ordonnancement

Laboratoire de l'Informatique du Parallélisme

Innovative Computing Laboratory [Knoxville]

Résumé

This paper compares several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. These methods include replication, triplication, Algorithm-Based Fault Tolerance (ABFT) and residual checking (RC). Error correction for ABFT can be achieved either by solving a small-size linear system of equations, or by recomputing corrupted coefficients. We show that both approaches can be used for RC. We provide a synthetic presentation of all methods before discussing their pros and cons. We have implemented all these methods with calls to optimized BLAS routines, and we provide performance data for a wide range of failure rates and matrix sizes.

Domaines

Informatique [cs]

Fichier principal

resilience-europar-hal.pdf (461.42 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Equipe Roma : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03029309

Soumis le : lundi 30 novembre 2020-09:42:07

Dernière modification le : jeudi 11 mai 2023-11:56:10

Archivage à long terme le : lundi 1 mars 2021-18:14:34

Dates et versions

hal-03029309 , version 1 (30-11-2020)

Identifiants

HAL Id : hal-03029309 , version 1

Citer

Valentin Le Fèvre, Thomas Herault, Julien Langou, Yves Robert. A comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. Resilience 2020 - 12th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids (colocated with Euro-Par), Aug 2020, Warsaw, Poland. pp.1-14. ⟨hal-03029309⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS INRIA UNIV-LYON1 INRIA2 UDL

25 Consultations

178 Téléchargements

A comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager