An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures

Christine Morin 1 Anne-Marie Kermarrec 2 Michel Banâtre 3 Alain Gefflaut 4
1 PARIS - Programming distributed parallel systems for large scale numerical simulation
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, ENS Cachan - École normale supérieure - Cachan, Inria Rennes – Bretagne Atlantique
3 SOLIDOR - Design of Distributed Operating Systems
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, INRIA Rennes
Abstract : Distributed Shared Memory (DSM) architectures are attractive to execute high performance parallel applications. Made up of a large number of components, these architectures have however a high probability of failure. We propose a protocol to tolerate node failures in cache-based DSM architectures. The proposed solution is based on backward error recovery and consists of an extension to the existing coherence protocol to manage data used by processors for the computation and recovery data used for fault tolerance. This approach can be applied to both Cache Only Memory Architectures (COMA) and Shared Virtual Memory (SVM) systems. The implementation of the protocol in a COMA architecture has been evaluated by simulation. The protocol has also been implemented in an SVM system on a network of workstations. Both simulation results and measurements show that our solution is efficient and scalable.
Type de document :
Article dans une revue
IEEE Transactions on Computers, Institute of Electrical and Electronics Engineers, 2000, IEEE Transactions on Computers, 49 (5), pp.414-430. 〈http://www.computer.org/portal/web/csdl/doi/10.1109/12.859537〉
Liste complète des métadonnées

Littérature citée [27 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00435092
Contributeur : Christine Morin <>
Soumis le : lundi 23 novembre 2009 - 15:32:24
Dernière modification le : jeudi 11 janvier 2018 - 06:20:10
Document(s) archivé(s) le : jeudi 17 juin 2010 - 21:29:54

Fichier

t0414.pdf
Fichiers éditeurs autorisés sur une archive ouverte

Identifiants

  • HAL Id : inria-00435092, version 1

Collections

Citation

Christine Morin, Anne-Marie Kermarrec, Michel Banâtre, Alain Gefflaut. An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures. IEEE Transactions on Computers, Institute of Electrical and Electronics Engineers, 2000, IEEE Transactions on Computers, 49 (5), pp.414-430. 〈http://www.computer.org/portal/web/csdl/doi/10.1109/12.859537〉. 〈inria-00435092〉

Partager

Métriques

Consultations de la notice

307

Téléchargements de fichiers

147