A distributed and replicated service for checkpoint storage

Fatiha Bouabache; Thomas Herault; Gilles Fedak; Franck Cappello

doi:10.1007/978-0-387-78448-9_24

Chapitre D'ouvrage Année : 2008

A distributed and replicated service for checkpoint storage

(1, 2) , (2, 3) , (2, 4) , (1, 3)

1
2
3
4

Fatiha Bouabache

Fonction : Auteur
PersonId : 924108

Laboratoire de Recherche en Informatique

Joint Laboratory for Petascale Computing [Illinois]

Thomas Herault

Fonction : Auteur
PersonId : 833735

Joint Laboratory for Petascale Computing [Illinois]

Global parallel and distributed computing

Gilles Fedak

Fonction : Auteur
PersonId : 2289
IdHAL : gilles-fedak
IdRef : 076982327

Joint Laboratory for Petascale Computing [Illinois]

Algorithms and Software Architectures for Distributed and HPC Platforms

Franck Cappello

Fonction : Auteur
PersonId : 828491

Laboratoire de Recherche en Informatique

Global parallel and distributed computing

Résumé

As High Performance platforms (Clusters, Grids, etc.) continue to grow in size, the average time between failures decreases to a critical level. An efficient and reliable fault tolerance protocol plays a key role in High Performance Computing. Rollback recovery is the most common fault tolerance technique used in High Performance Computing and especially in MPI applications. This technique relies on the reliability of the checkpoint storage, most of the rollback recovery protocols assume that the checkpoint servers machines are reliable. However, in a grid environment any unit can fail at any moment, including components used to connect different administrative domains. Such a failure leads to the loss of a whole set of machines, including the more reliable machines used to store the checkpoints in this administrative domain. It is thus not safe to rely on the high MTBF of specific machines to store the checkpoint images. This paper introduces a new protocol that ensure the checkpoint storage reliability even if one or more Checkpoint Servers fail. To provide this reliability the protocol is based on a replication process. We evaluate our solution through simulations against several criteria: scalability, topology, and reliability of the nodes. We also compare between two replication strategies to decide which one should be used in the implementation.

Domaines

Autre [cs.OH]

Ist Rennes : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00689921

Soumis le : vendredi 20 avril 2012-15:21:24

Dernière modification le : lundi 12 février 2024-10:42:04

Dates et versions

hal-00689921 , version 1 (20-04-2012)

Identifiants

HAL Id : hal-00689921 , version 1
DOI : 10.1007/978-0-387-78448-9_24

Citer

Fatiha Bouabache, Thomas Herault, Gilles Fedak, Franck Cappello. A distributed and replicated service for checkpoint storage. {M. Danelutto and P. Fragopoulou and V. Getov. Making Grids Work, Springer, pp.295-306, 2008, ⟨10.1007/978-0-387-78448-9_24⟩. ⟨hal-00689921⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON EC-PARIS UNIV-RENNES1 UNIV-LILLE3 CNRS INRIA UNIV-LYON1 IRISA UMR8623 INRIA2 UR1-MATH-STIC UNIV-PARIS-SACLAY UR1-UFR-ISTIC UNIV-RENNES UDL UR1-MATH-NUM

220 Consultations

0 Téléchargements

A distributed and replicated service for checkpoint storage

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager