Skip to Main content Skip to Navigation
Conference papers

Resilience at extreme scale : system level, algorithmic level or both?

Luc Giraud 1, 2 Franck Cappello 3, 4, 5
1 HiePACS - High-End Parallel Algorithms for Challenging Numerical Simulations
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest
3 GRAND-LARGE - Global parallel and distributed computing
CNRS - Centre National de la Recherche Scientifique : UMR8623, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LIFL - Laboratoire d'Informatique Fondamentale de Lille, LRI - Laboratoire de Recherche en Informatique
Abstract : Resilience is a critical problem for extreme scale numerical simulations. The most credible solution is still based on checkpoint/restart with its high overheads or hardware cost. It has been shown recently that some algorithmic approaches and some code characteristics can help reducing these costs through combined system-algorithmic/application approaches. However, we are still looking for a right solution to this simple question: how to reduce simultaneously and significantly state saving and recovery times?
Document type :
Conference papers
Complete list of metadata
Contributor : Luc Giraud Connect in order to contact the contributor
Submitted on : Thursday, July 25, 2013 - 12:57:13 PM
Last modification on : Thursday, October 7, 2021 - 3:20:12 PM


  • HAL Id : hal-00799309, version 1


Luc Giraud, Franck Cappello. Resilience at extreme scale : system level, algorithmic level or both?. SIAM Conference on Computational Science and Engineering (SIAM CSE 2013), Feb 2013, Boston, United States. ⟨hal-00799309⟩



Record views