Checkpointing and Recovery of Shared Memory Parallel Applications in a Cluster

Ramamurthy Badrinath 1 Christine Morin 1 Geoffroy Vallée 1
1 PARIS - Programming distributed parallel systems for large scale numerical simulation
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, ENS Cachan - École normale supérieure - Cachan, Inria Rennes – Bretagne Atlantique
Abstract : This paper describes issues in the design and implementation of checkpointing and recovery modules for the Kerrighed DSM cluster system. Our design is for a DSM supporting the sequential consistency model. The mechanisms are general enough to be used in a number of different checkpointing and recovery protocols. It is designed to support common optimizations for performance suggested in literature, while staying light-weight during fault-free execution. We also present preliminary performance results of the current implementation.
Document type :
Reports
Complete list of metadatas

https://hal.inria.fr/inria-00071780
Contributor : Rapport de Recherche Inria <>
Submitted on : Tuesday, May 23, 2006 - 6:46:03 PM
Last modification on : Friday, November 16, 2018 - 1:23:00 AM
Long-term archiving on : Sunday, April 4, 2010 - 10:37:39 PM

Identifiers

  • HAL Id : inria-00071780, version 1

Citation

Ramamurthy Badrinath, Christine Morin, Geoffroy Vallée. Checkpointing and Recovery of Shared Memory Parallel Applications in a Cluster. [Research Report] RR-4806, INRIA. 2003. ⟨inria-00071780⟩

Share

Metrics

Record views

417

Files downloads

234