Resiliency in numerical algorithm design for extreme scale simulations

Emmanuel Agullo; Mirco Altenbernd; Hartwig Anzt; Leonardo Bautista-Gomez; Tommaso Benacchio; Luca Bonaventura; Hans-Joachim Bungartz; Sanjay Chatterjee; Florina M Ciorba; Nathan Debardeleben; Daniel Drzisga; Sebastian Eibl; Christian Engelmann; Wilfried N Gansterer; Luc Giraud; Dominik Göddeke; Marco Heisig; Fabienne Jézéquel; Nils Kohl; Sherry Xiaoye; Romain Lion; Miriam Mehl; Paul Mycek; Michael Obersteiner; Enrique S Quintana-Ortí; Francesco Rizzi; Ulrich Rüde; Martin Schulz; Fred Fung; Robert Speck; Linda Stals; Keita Teranishi; Samuel Thibault; Dominik Thönnes; Andreas Wagner; Barbara Wohlmuth

doi:10.1177/10943420211055188

Article Dans Une Revue International Journal of High Performance Computing Applications Année : 2021

Resiliency in numerical algorithm design for extreme scale simulations

(1) , (2) , (3) , (4) , (5) , (5) , (6) , (7) , (8) , (9) , (6) , (10) , (11) , (12) , (1) , (2) , (10) , (13, 14) , (10) , (15) , (16, 17) , (2) , (18) , (6) , (19) , (20) , (10) , (6) , (21) , (22) , (21) , (23) , (16, 17) , (10) , (6) , (6)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

Emmanuel Agullo

Fonction : Auteur
PersonId : 10278
IdHAL : emmanuel-agullo
ORCID : 0000-0003-0655-6934
IdRef : 150042116

High-End Parallel Algorithms for Challenging Numerical Simulations

Mirco Altenbernd

Fonction : Auteur

Universität Stuttgart [Stuttgart]

Hartwig Anzt

Fonction : Auteur

Karlsruher Institut für Technologie

Leonardo Bautista-Gomez

Fonction : Auteur

Barcelona Supercomputing Center - Centro Nacional de Supercomputacion

Tommaso Benacchio

Fonction : Auteur

Politecnico di Milano [Milan]

Luca Bonaventura

Fonction : Auteur

Politecnico di Milano [Milan]

Hans-Joachim Bungartz

Fonction : Auteur

Technische Universität Munchen - Technical University Munich - Université Technique de Munich

Sanjay Chatterjee

Fonction : Auteur

NVIDIA Corporation [Bangalore]

Florina M Ciorba

Fonction : Auteur

University Hospital Basel [Basel]

Nathan Debardeleben

Fonction : Auteur

Los Alamos National Laboratory

Daniel Drzisga

Fonction : Auteur

Technische Universität Munchen - Technical University Munich - Université Technique de Munich

Sebastian Eibl

Fonction : Auteur

Friedrich-Alexander Universität Erlangen-Nürnberg = University of Erlangen-Nuremberg

Christian Engelmann

Fonction : Auteur

Oak Ridge National Laboratory [Oak Ridge]

Wilfried N Gansterer

Fonction : Auteur

University of Vienna [Vienna]

Luc Giraud

Fonction : Auteur
PersonId : 8816
IdHAL : luc-giraud
ORCID : 0000-0002-7062-7672
IdRef : 074267418

High-End Parallel Algorithms for Challenging Numerical Simulations

Dominik Göddeke

Fonction : Auteur

Universität Stuttgart [Stuttgart]

Marco Heisig

Fonction : Auteur

Friedrich-Alexander Universität Erlangen-Nürnberg = University of Erlangen-Nuremberg

Fabienne Jézéquel

Fonction : Auteur

Performance et Qualité des Algorithmes Numériques

Université Panthéon-Assas

Nils Kohl

Fonction : Auteur

Friedrich-Alexander Universität Erlangen-Nürnberg = University of Erlangen-Nuremberg

Sherry Xiaoye

Fonction : Auteur

Lawrence Berkeley National Laboratory [Berkeley]

Romain Lion

Fonction : Auteur

Université de Bordeaux

STatic Optimizations, Runtime Methods

Miriam Mehl

Fonction : Auteur

Universität Stuttgart [Stuttgart]

Paul Mycek

Fonction : Auteur
PersonId : 176988
IdHAL : paul-mycek
ORCID : 0000-0002-6919-112X
IdRef : 175743916

Centre Européen de Recherche et de Formation Avancée en Calcul Scientifique

Michael Obersteiner

Fonction : Auteur
PersonId : 757765
ORCID : 0000-0002-5705-1787

Technische Universität Munchen - Technical University Munich - Université Technique de Munich

Enrique S Quintana-Ortí

Fonction : Auteur

Universitat Politècnica de València = Universitad Politecnica de Valencia = Polytechnic University of Valencia

Francesco Rizzi

Fonction : Auteur

NexGen Analytics

Ulrich Rüde

Fonction : Auteur
PersonId : 1093088

Friedrich-Alexander Universität Erlangen-Nürnberg = University of Erlangen-Nuremberg

Martin Schulz

Fonction : Auteur

Technische Universität Munchen - Technical University Munich - Université Technique de Munich

Fred Fung

Fonction : Auteur

Australian National University

Robert Speck

Fonction : Auteur

Jülich Supercomputing Centre

Linda Stals

Fonction : Auteur

Australian National University

Keita Teranishi

Fonction : Auteur

Sandia National Laboratories - Corporation

Samuel Thibault

Fonction : Auteur
PersonId : 8135
IdHAL : samuel-thibault
ORCID : 0000-0001-6411-809X
IdRef : 12476486X

Université de Bordeaux

STatic Optimizations, Runtime Methods

Dominik Thönnes

Fonction : Auteur

Friedrich-Alexander Universität Erlangen-Nürnberg = University of Erlangen-Nuremberg

Andreas Wagner

Fonction : Auteur

Technische Universität Munchen - Technical University Munich - Université Technique de Munich

Barbara Wohlmuth

Fonction : Auteur

Technische Universität Munchen - Technical University Munich - Université Technique de Munich

Résumé

This work is based on the seminar titled “Resiliency in Numerical Algorithm Design for Extreme Scale Simulations” held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 hours on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications, and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.

Domaines

Analyse numérique [math.NA] Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

2010.13342.pdf (1.28 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Luc Giraud : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03348787

Soumis le : lundi 20 septembre 2021-09:10:32

Dernière modification le : mercredi 20 mars 2024-17:52:16

Archivage à long terme le : mardi 21 décembre 2021-18:13:45

Dates et versions

hal-03348787 , version 1 (20-09-2021)

Identifiants

HAL Id : hal-03348787 , version 1
ARXIV : 2010.13342
DOI : 10.1177/10943420211055188

Citer

Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, et al.. Resiliency in numerical algorithm design for extreme scale simulations. International Journal of High Performance Computing Applications, 2021, pp.10943420211055188. ⟨10.1177/10943420211055188⟩. ⟨hal-03348787⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA IRISA LIP6 INRIA2 TDS-MACS UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES SORBONNE-UNIVERSITE SU-SCIENCES UR1-MATH-NUM PANTHEON-ASSAS-UNIVERSITE

223 Consultations

483 Téléchargements

Resiliency in numerical algorithm design for extreme scale simulations

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager