Transparent Message-Passing Parallel Applications Checkpointing in Kerrighed

Matthieu Fertré 1 Christine Morin 1
1 PARIS - Programming distributed parallel systems for large scale numerical simulation
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, ENS Cachan - École normale supérieure - Cachan, Inria Rennes – Bretagne Atlantique
Abstract : Nowadays, clusters are widely used to execute scientific applications. These applications are often message-passing parallel applications with long execution time. Since the number of nodes in clusters is growing, the probability of a node failure during the execution of an application increases and the application execution time may be greater than the cluster mean time between failures (MTBF). To avoid restarting application from the beginning, some fault tolerant mechanisms such as checkpoint/restart are needed. Currently, checkpoint/restart mechanisms are either implemented directly in the application source code by applications programmers or are integrated in communication environments such as MPI or PVM. We propose in this paper a new approach in which checkpoint/restart mechanisms for parallel applications are implemented in a cluster single system image operating system. While this kernel level approach is more complex to implement than other approaches, it is more general because it does not require any modification, compilation or relinking of the applications whatever the communication environment they rely on. Our approach has been implemented in single system image operating system based on. Performance results are presented in this paper.
Document type :
Reports
Complete list of metadatas

https://hal.inria.fr/inria-00070265
Contributor : Rapport de Recherche Inria <>
Submitted on : Friday, May 19, 2006 - 7:46:56 PM
Last modification on : Friday, November 16, 2018 - 1:27:55 AM
Long-term archiving on : Sunday, April 4, 2010 - 8:46:38 PM

Identifiers

  • HAL Id : inria-00070265, version 1

Citation

Matthieu Fertré, Christine Morin. Transparent Message-Passing Parallel Applications Checkpointing in Kerrighed. [Research Report] RR-5755, INRIA. 2005, pp.13. ⟨inria-00070265⟩

Share

Metrics

Record views

226

Files downloads

126