inria-00074117, version 1
Consistent Checkpointing in Message Passing Distributed Systems
N° RR-2564 (1995)
Résumé : A global checkpoint of a distributed computation is a a set of local checkpoints (local states), one per process. Determining consistent global checkpoints is a very important problem for many distributed applications (e.g. fault-tolerance, distributed debugging, properties detection, etc). This paper concentrates on such determinations. A precedence relation on checkpoint intervals (such intervals are sets of events produced by processes between two successive local checkpoints) is introduced and analyzed. It is shown that a local chekpoint is useless (i.e. it cannot participate in any consistent global checkpoint) iff some pattern appears in this precedence relation. Then an adaptive checkpointing algorithm is introduced. This algorithm, assuming processes take local checkpoints independently, requires them to take (as few as possible) additional ckeckpoints in order that none of previously taken checkpoints be useless. It is based on the prevention of the previously mentioned pattern. In some sense, this algorithm combines advantages of both coordinated and uncoordinated checkpointing algorithms without inheriting their drawbacks.
- a – Université Rennes I
- 1 :
- CNRS : UMR6074 – INRIA – Institut National des Sciences Appliquées (INSA) - Rennes – Université de Rennes 1
- Domaine : Informatique/Autre
- Mots-clés : CONSISTENT GLOBAL CHECKPOINTS / MESSAGE COMMUNICATION SYSTEMS / ADAPTIVE CHECKPOINTING / CAUSALITY / HAPPENED-BEFORE RELATION
- Référence interne : RR-2564
- inria-00074117, version 1
- http://hal.inria.fr/inria-00074117
- oai:hal.inria.fr:inria-00074117
- Contributeur :
- Soumis le : Mercredi 24 Mai 2006, 14:33:23
- Dernière modification le : Jeudi 8 Mars 2007, 16:35:52





Documents associés

Exporter