HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation

Common Mechanisms for Supporting Fault Tolerance in DSM and Message Passing Systems

Ramamurthy Badrinath 1 Christine Morin 1
1 PARIS - Programming distributed parallel systems for large scale numerical simulation
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, ENS Cachan - École normale supérieure - Cachan, Inria Rennes – Bretagne Atlantique
Abstract : Backward error recovery involving checkpointing and restart of tasks is an important component of any system providing fault tolerance to applicati- ons distributed over a network. A central problem to checkpointing and recovery is the ability to track dependencies and arrive at a consistent global checkpoint. Traditionally literature treats one of either distributed shared memory (DSM) or message passing as the interprocess communication mechanism when considering the issue of fault tolerance. This paper describes preliminary investigation into common mechanisms that can be implemented to support a wide variety of protocols in both shared memory and message passing systems. In effect it can be used in a system that combines both these IPC mechanisms.
Document type :
Complete list of metadata

Contributor : Rapport de Recherche Inria Connect in order to contact the contributor
Submitted on : Tuesday, May 23, 2006 - 7:23:22 PM
Last modification on : Friday, February 4, 2022 - 3:25:18 AM
Long-term archiving on: : Sunday, April 4, 2010 - 10:47:02 PM


  • HAL Id : inria-00071972, version 1


Ramamurthy Badrinath, Christine Morin. Common Mechanisms for Supporting Fault Tolerance in DSM and Message Passing Systems. [Research Report] RR-4613, INRIA. 2002. ⟨inria-00071972⟩



Record views


Files downloads