HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations

Sébastien Monnet 1 Christine Morin 1 Ramamurthy Badrinath 1
1 PARIS - Programming distributed parallel systems for large scale numerical simulation
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, ENS Cachan - École normale supérieure - Cachan, Inria Rennes – Bretagne Atlantique
Abstract : Code coupling applications can be divided into communicating modules, that may be executed on different clusters in a cluster federation. As a cluster federation comprises of a large number of nodes, there is a high probability of a node failure. We propose a hierarchical checkpointing protocol that combines a synchronized checkpointing technique inside clusters and a communication-induced technique between clusters. This protocol fits to the characteristics of a cluster federation (large number of nodes, high latency and low bandwidth networking technologies between clusters). A preliminary performance evaluation performed using a discrete event simulator shows that the protocol is suitable for code coupling applications.
Document type :
Complete list of metadata

Cited literature [1 references]  Display  Hide  Download

Contributor : Rapport de Recherche Inria Connect in order to contact the contributor
Submitted on : Tuesday, May 23, 2006 - 5:44:23 PM
Last modification on : Friday, February 4, 2022 - 3:25:19 AM
Long-term archiving on: : Sunday, April 4, 2010 - 10:19:17 PM


  • HAL Id : inria-00071492, version 1


Sébastien Monnet, Christine Morin, Ramamurthy Badrinath. A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations. [Research Report] RR-5091, INRIA. 2004. ⟨inria-00071492⟩



Record views


Files downloads