Skip to Main content Skip to Navigation
Conference papers

A comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication

Abstract : This paper compares several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. These methods include replication, triplication, Algorithm-Based Fault Tolerance (ABFT) and residual checking (RC). Error correction for ABFT can be achieved either by solving a small-size linear system of equations, or by recomputing corrupted coefficients. We show that both approaches can be used for RC. We provide a synthetic presentation of all methods before discussing their pros and cons. We have implemented all these methods with calls to optimized BLAS routines, and we provide performance data for a wide range of failure rates and matrix sizes.
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/hal-03029309
Contributor : Equipe Roma <>
Submitted on : Monday, November 30, 2020 - 9:42:07 AM
Last modification on : Thursday, December 3, 2020 - 1:50:25 PM

File

resilience-europar-hal.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-03029309, version 1

Collections

Citation

Valentin Le Fèvre, Thomas Herault, Julien Langou, Yves Robert. A comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. Resilience 2020 - 12th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids (colocated with Euro-Par), Aug 2020, Warsaw, Poland. pp.1-14. ⟨hal-03029309⟩

Share

Metrics

Record views

7

Files downloads

37