HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations

Sébastien Monnet 1 Christine Morin 1 Ramamurthy Badrinath 2
1 PARIS - Programming distributed parallel systems for large scale numerical simulation
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, ENS Cachan - École normale supérieure - Cachan, Inria Rennes – Bretagne Atlantique
Abstract : A new kind of application is born. Code coupling applications consist of applications that can be divided into modules. They often need to run on several clusters. However, in these huge architectures that we call ``cluster federations'', there's a large number of nodes. Faults may appear very frequently. Thus a fault tolerance mechanism that fits these architectures and these kind of applications should be provided. We propose a hierarchical checkpointing protocol that combines synchronized methods inside clusters and communication induced methods between clusters. Our protocol has been evaluated by a discrete event simulation. The first results show that it works well for the targeted applications.
Complete list of metadata

Cited literature [10 references]  Display  Hide  Download

Contributor : Sébastien Monnet Connect in order to contact the contributor
Submitted on : Tuesday, January 10, 2006 - 8:07:44 PM
Last modification on : Friday, February 4, 2022 - 3:25:36 AM
Long-term archiving on: : Monday, September 20, 2010 - 1:34:47 PM


  • HAL Id : inria-00000990, version 2


Sébastien Monnet, Christine Morin, Ramamurthy Badrinath. A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations. 9th IEEE Workshop on Fault-Tolerant Parallel, Distributed and Network-Centric Systems, Apr 2004, Santa Fe, New Mexico, Mexico. pp.211. ⟨inria-00000990v2⟩



Record views


Files downloads