Hybrid Checkpointing for Parallel Applications in Cluster Federations

Sébastien Monnet 1 Christine Morin 1 Ramamurthy Badrinath 1
1 PARIS - Programming distributed parallel systems for large scale numerical simulation
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, ENS Cachan - École normale supérieure - Cachan, Inria Rennes – Bretagne Atlantique
Abstract : Cluster federations are very useful for applications like large scale code coupling. Faults may appear very frequently, so we want to use checkpoints to be able to restart applications. To take into account the constraints introduced by clusters federation architecture, we propose a hierarchical checkpointing protocol. It uses synchronization inside clusters but only quasi-synchronous methods between clusters. Our protocol has been evaluate by simulation and fits well for applications that can be divided in modules with a lot of communications inside modules but few between them.
Document type :
Reports
Complete list of metadatas

Cited literature [12 references]  Display  Hide  Download

https://hal.inria.fr/inria-00071577
Contributor : Rapport de Recherche Inria <>
Submitted on : Tuesday, May 23, 2006 - 6:01:07 PM
Last modification on : Friday, November 16, 2018 - 1:30:31 AM
Long-term archiving on : Sunday, April 4, 2010 - 10:26:23 PM

Identifiers

  • HAL Id : inria-00071577, version 1

Citation

Sébastien Monnet, Christine Morin, Ramamurthy Badrinath. Hybrid Checkpointing for Parallel Applications in Cluster Federations. [Research Report] RR-5007, INRIA. 2003. ⟨inria-00071577⟩

Share

Metrics

Record views

247

Files downloads

167