A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations

Sébastien Monnet 1 Christine Morin 1 Ramamurthy Badrinath 2
1 PARIS - Programming distributed parallel systems for large scale numerical simulation
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, ENS Cachan - École normale supérieure - Cachan, Inria Rennes – Bretagne Atlantique
Abstract : A new kind of application is born. Code coupling applications consist of applications that can be divided into modules. They often need to run on several clusters. However, in these huge architectures that we call ``cluster federations'', there's a large number of nodes. Faults may appear very frequently. Thus a fault tolerance mechanism that fits these architectures and these kind of applications should be provided. We propose a hierarchical checkpointing protocol that combines synchronized methods inside clusters and communication induced methods between clusters. Our protocol has been evaluated by a discrete event simulation. The first results show that it works well for the targeted applications.
Complete list of metadatas

Cited literature [10 references]  Display  Hide  Download

https://hal.inria.fr/inria-00000990
Contributor : Sébastien Monnet <>
Submitted on : Tuesday, January 10, 2006 - 8:07:44 PM
Last modification on : Friday, November 16, 2018 - 1:24:12 AM
Long-term archiving on : Monday, September 20, 2010 - 1:34:47 PM

Identifiers

  • HAL Id : inria-00000990, version 2

Citation

Sébastien Monnet, Christine Morin, Ramamurthy Badrinath. A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations. 9th IEEE Workshop on Fault-Tolerant Parallel, Distributed and Network-Centric Systems, Apr 2004, Santa Fe, New Mexico, Mexico. pp.211. ⟨inria-00000990v2⟩

Share

Metrics

Record views

337

Files downloads

192