FAIL-MPI: How fault-tolerant is fault-tolerant MPI ?

Thomas Hérault 1, 2 William Hoarau 1, 2 Pierre Lemarinier 1, 2 Eric Rodriguez 1 Sébastien Tixeuil 1, 2
1 GRAND-LARGE - Global parallel and distributed computing
CNRS - Centre National de la Recherche Scientifique : UMR8623, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LIFL - Laboratoire d'Informatique Fondamentale de Lille, LRI - Laboratoire de Recherche en Informatique
Abstract : One of the topics of paramount importance in the development of Cluster and Grid middleware is the impact of faults since their occurrence probability in a Grid infrastructure and in large-scale distributed system is actually very high. MPI (Message Passing Interface) is a popular abstraction for programming distributed computation applications. FAIL is an abstract language for fault occurrence description capable of expressing complex and realistic fault scenarios. In this paper, we investigate the possibility of using FAIL to inject faults in a fault-tolerant MPI implementation. Our middleware, FAIL-MPI, is used to carry quantitative and qualitative faults and stress testing.
Complete list of metadatas

https://hal.inria.fr/inria-00078183
Contributor : Sébastien Tixeuil <>
Submitted on : Saturday, June 3, 2006 - 9:00:16 PM
Last modification on : Thursday, February 21, 2019 - 10:52:50 AM
Long-term archiving on : Tuesday, September 18, 2012 - 2:30:49 PM

Identifiers

  • HAL Id : inria-00078183, version 1

Collections

Citation

Thomas Hérault, William Hoarau, Pierre Lemarinier, Eric Rodriguez, Sébastien Tixeuil. FAIL-MPI: How fault-tolerant is fault-tolerant MPI ?. [Research Report] 1450, 2006, pp.26. ⟨inria-00078183⟩

Share

Metrics

Record views

835

Files downloads

356