Resilience at extreme scale : system level, algorithmic level or both?

Luc Giraud 1, 2 Franck Cappello 3, 4, 5
1 HiePACS - High-End Parallel Algorithms for Challenging Numerical Simulations
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest
3 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : Resilience is a critical problem for extreme scale numerical simulations. The most credible solution is still based on checkpoint/restart with its high overheads or hardware cost. It has been shown recently that some algorithmic approaches and some code characteristics can help reducing these costs through combined system-algorithmic/application approaches. However, we are still looking for a right solution to this simple question: how to reduce simultaneously and significantly state saving and recovery times?
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/hal-00799309
Contributor : Luc Giraud <>
Submitted on : Thursday, July 25, 2013 - 12:57:13 PM
Last modification on : Monday, December 9, 2019 - 5:24:06 PM

Identifiers

  • HAL Id : hal-00799309, version 1

Citation

Luc Giraud, Franck Cappello. Resilience at extreme scale : system level, algorithmic level or both?. SIAM Conference on Computational Science and Engineering (SIAM CSE 2013), Feb 2013, Boston, United States. ⟨hal-00799309⟩

Share

Metrics

Record views

406