inria-00560582, version 3
Checkpointing strategies for parallel jobs
Marin Bougeret
a, 1Henri Casanova
2Mikael Rabie 3Yves Robert
a, 1, 4Frédéric Vivien
b, 1
N° RR-7520 (2011)
Abstract: This work provides an analysis of checkpointing strategies for minimizing expected job execution times in an environment that is subject to processor failures. In the case of both sequential and parallel jobs, we give the optimal solution for exponentially distributed failure inter-arrival times, which, to the best of our knowledge, is the rst rigorous proof that periodic check- pointing is optimal. For non-exponentially distributed failures, we develop a dynamic programming algorithm to maximize the amount of work completed before the next failure, which provides a good heuristic for minimizing the ex- pected execution time. Our work considers various models of job parallelism and of parallel checkpointing overhead. We rst perform extensive simulation experiments assuming that failures follow Exponential or Weibull distributions, the latter being more representative of real-world systems. The obtained results not only corroborate our theoretical ndings, but also show that our dynamic programming algorithm signi cantly outperforms previously proposed solutions in the case of Weibull failures. We then discuss results from simulation experi- ments that use failure logs from production clusters. These results con rm that our dynamic programming algorithm signi cantly outperforms existing solutions for real-world clusters.
- a – École Normale Supérieure de Lyon
- b – INRIA
- 1: GRAAL (INRIA Grenoble Rhône-Alpes / LIP Laboratoire de l'Informatique du Parallélisme)
- CNRS : UMR5668 – INRIA – École Normale Supérieure - Lyon – Université Claude Bernard - Lyon I – Laboratoire d'informatique du Parallélisme
- 2: Department of Information and Computer Sciences (ICS Department)
- University of Hawai`i, Manoa
- 3: École normale supérieure de Lyon (ENS LYON)
- École Normale Supérieure - Lyon
- 4: Laboratoire de l'Informatique du Parallélisme (LIP)
- Université de Lyon – CNRS : UMR5668 – INRIA – École Normale Supérieure - Lyon – Université Claude Bernard - Lyon I
- Domain : Computer Science/Distributed, Parallel, and Cluster Computing
- Keywords : Fault-tolerance – checkpointing – sequential job – parallel job – Weibull
- Internal note : RR-7520
- Available versions : v1 (2011-01-29) v2 (2011-04-21) v3 (2011-04-22)
- inria-00560582, version 3
- http://hal.inria.fr/inria-00560582
- oai:hal.inria.fr:inria-00560582
- From: Marin Bougeret
- Submitted on: Friday, 22 April 2011 15:16:41
- Updated on: Tuesday, 26 April 2011 10:58:59






Associated documents
Export