Resilience at extreme scale : system level, algorithmic level or both?

Luc Giraud 1, 2 Franck Cappello 3, 4, 5
1 HiePACS - High-End Parallel Algorithms for Challenging Numerical Simulations
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest
3 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : Resilience is a critical problem for extreme scale numerical simulations. The most credible solution is still based on checkpoint/restart with its high overheads or hardware cost. It has been shown recently that some algorithmic approaches and some code characteristics can help reducing these costs through combined system-algorithmic/application approaches. However, we are still looking for a right solution to this simple question: how to reduce simultaneously and significantly state saving and recovery times?
Type de document :
Communication dans un congrès
SIAM Conference on Computational Science and Engineering (SIAM CSE 2013), Feb 2013, Boston, United States. 2013
Liste complète des métadonnées

https://hal.inria.fr/hal-00799309
Contributeur : Luc Giraud <>
Soumis le : jeudi 25 juillet 2013 - 12:57:13
Dernière modification le : jeudi 5 avril 2018 - 12:30:12

Identifiants

  • HAL Id : hal-00799309, version 1

Citation

Luc Giraud, Franck Cappello. Resilience at extreme scale : system level, algorithmic level or both?. SIAM Conference on Computational Science and Engineering (SIAM CSE 2013), Feb 2013, Boston, United States. 2013. 〈hal-00799309〉

Partager

Métriques

Consultations de la notice

362