COMA: an Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors

Christine Morin 1 Alain Gefflaut 1 Michel Banâtre 1 Anne-Marie Kermarrec 1
1 SOLIDOR - Design of Distributed Operating Systems
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, INRIA Rennes
Abstract : Due to the increasing number of their components, Scalable Shared Memory Multiprocessors (SSMMs) have a very high probability of experiencing failures. Tolerating node failures therefore becomes very important for these architectures particularly if they must be used for long-running computations. In this paper, we show that the class of Cache Only Memory Architectures (COMA) are good candidates for building fault-tolerant SSMMs. A backward error recovery strategy can be implemented without significant hardware modification to previously proposed COMA by exploiting their standard replication mechanisms and extending the coherence protocol to transparently manage recovery data. Evaluation of the proposed fault-tolerant COMA is based on execution driven simulations using some of the Splash applications. We show that, for the simulated architecture, the performance degradation caused by fault-tolerance mechanisms varies from 5% in the best case to 35% in the worst case. The standard memory behavior is only slightly perturbed. Moreover, results also show that the proposed scheme preserves the architecture scalability and that the memory overhead remains low for parallel applications using mostly shared data.
Type de document :
Communication dans un congrès
International Symposium on Computer Architecture, 1996, Philadelphie, United States. 1996
Liste complète des métadonnées

https://hal.inria.fr/inria-00435234
Contributeur : Christine Morin <>
Soumis le : lundi 23 novembre 2009 - 21:44:19
Dernière modification le : mercredi 11 avril 2018 - 01:54:24
Document(s) archivé(s) le : jeudi 17 juin 2010 - 18:27:01

Fichier

MorGefBanKer96isca.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00435234, version 1

Citation

Christine Morin, Alain Gefflaut, Michel Banâtre, Anne-Marie Kermarrec. COMA: an Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors. International Symposium on Computer Architecture, 1996, Philadelphie, United States. 1996. 〈inria-00435234〉

Partager

Métriques

Consultations de la notice

347

Téléchargements de fichiers

53