Skip to Main content Skip to Navigation
New interface
Conference papers

Resilience at extreme scale : system level, algorithmic level or both?

Luc Giraud 1, 2 Franck Cappello 3, 4, 5 
1 HiePACS - High-End Parallel Algorithms for Challenging Numerical Simulations
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest
3 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : Resilience is a critical problem for extreme scale numerical simulations. The most credible solution is still based on checkpoint/restart with its high overheads or hardware cost. It has been shown recently that some algorithmic approaches and some code characteristics can help reducing these costs through combined system-algorithmic/application approaches. However, we are still looking for a right solution to this simple question: how to reduce simultaneously and significantly state saving and recovery times?
Document type :
Conference papers
Complete list of metadata
Contributor : Luc Giraud Connect in order to contact the contributor
Submitted on : Thursday, July 25, 2013 - 12:57:13 PM
Last modification on : Friday, November 18, 2022 - 9:25:23 AM


  • HAL Id : hal-00799309, version 1


Luc Giraud, Franck Cappello. Resilience at extreme scale : system level, algorithmic level or both?. SIAM Conference on Computational Science and Engineering (SIAM CSE 2013), Feb 2013, Boston, United States. ⟨hal-00799309⟩



Record views