Skip to Main content Skip to Navigation
New interface
Conference papers

Multi-criteria checkpointing strategies: response-time versus resource utilization

Aurélien Bouteiller 1 Franck Cappello 2, 3, 4, 5 Jack Dongarra 1 Amina Guermouche 6, 7 Thomas Herault 1 Yves Robert 7, 6 
4 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
6 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency.
Complete list of metadata

Cited literature [19 references]  Display  Hide  Download
Contributor : Equipe Roma Connect in order to contact the contributor
Submitted on : Tuesday, January 14, 2014 - 3:17:50 PM
Last modification on : Friday, November 18, 2022 - 9:25:21 AM
Long-term archiving on: : Tuesday, April 15, 2014 - 4:18:47 PM


Files produced by the author(s)



Aurélien Bouteiller, Franck Cappello, Jack Dongarra, Amina Guermouche, Thomas Herault, et al.. Multi-criteria checkpointing strategies: response-time versus resource utilization. Euro-Par 2013, 2013, Aachen, Germany. pp.420-431, ⟨10.1007/978-3-642-40047-6_43⟩. ⟨hal-00926606⟩



Record views


Files downloads