An architecture for tolerating processor failures in shared-memory multiprocessors

Abstract : In this paper, we focus on the problem of recovering processor failures in shared memory multiprocessors. We propose an architecture designed for transparently tolerating processor failures. The recoverable shared memory (RSM) in the main component of this architecture which provides a hardware supported backward error recovery mechanism. This technique copes with standard caches and cache coherence protocols and avoids rollback propagation. The performance of the architecture during normal execution is evaluated and compared with that of existing fault tolerant shared memory multiprocessors. The performance study has been conducted by simulation using address traces collected from real parallel applications.
Document type :
Reports
Complete list of metadatas

https://hal.inria.fr/inria-00074708
Contributor : Rapport de Recherche Inria <>
Submitted on : Wednesday, May 24, 2006 - 4:06:56 PM
Last modification on : Friday, November 16, 2018 - 1:28:30 AM
Long-term archiving on : Monday, April 5, 2010 - 12:13:47 AM

Identifiers

  • HAL Id : inria-00074708, version 1

Citation

Michel Banâtre, Alain Gefflaut, Philippe Joubert, Peter Lee, Christine Morin. An architecture for tolerating processor failures in shared-memory multiprocessors. [Research Report] RR-1965, INRIA. 1993. ⟨inria-00074708⟩

Share

Metrics

Record views

317

Files downloads

270