8481 articles  [english version]

inria-00074112, version 1

On Modeling Consistent Checkpoints and the Domino Effect in Distributed Systems

Roberto Baldoni () 1, Jean-Michel Hélary a1, Achour Mostefaoui () a1, Michel Raynal () a1

N° RR-2569 (1995)

Résumé : Backward error recovery is one of the most used schemes to ensure fault-tolera- nce in distributed systems. It consists, upon the occurrence of a failure, in restoring a distributed computation to an error-free global state from which it can be resumed to produce a correct behavior. Checkpointing is one of the techniques to pursue the backward error recovery. In this paper, we present a general framework that takes a semantic including missing and orphan messages into account. Notions of missings and orphans are revisited by considering additional underlying mechanism available on channels and semantics of messages. This framework allows, first, to state and prove a theorem to determine if an arbitrary set of checkpoints is consistent and, second, to define formally the domino effect. Further, we show how previously published uncoordinated checkpointing algorithms can be described in our context and some example of uncoordinated checkpointin- g algorithms that ensure domino-free rollback recovery are also given.

  • a –  Université Rennes I
  • 1 :  ADP (INRIA - IRISA)
  • CNRS : UMR6074 – INRIA – Institut National des Sciences Appliquées (INSA) - Rennes – Université de Rennes 1
  • Domaine : Informatique/Autre
  • Mots-clés : CONSISTENT GLOBAL CHECKPOINTS / LAMPORT'S HAPPENED-BEFORE RELATION / ORPHAN AND MISSING MESSAGES / DOMINO EFFECT / UNCOORDINATED CHECKPOINTING ALGORITHMS / FAULT-TOLERANCE / DISTRIBUTED SYSTEMS
  • Référence interne : RR-2569
 
  • inria-00074112, version 1
  • oai:hal.inria.fr:inria-00074112
  • Contributeur : 
  • Soumis le : Mercredi 24 Mai 2006, 14:32:55
  • Dernière modification le : Jeudi 8 Mars 2007, 16:35:14