Energy-aware checkpointing strategies - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Chapitre D'ouvrage Année : 2015

Energy-aware checkpointing strategies

Résumé

Future extreme-scale supercomputers will gather hundreds of million cores. The main problem that we address is energy consumption since such systems will consume enormous amount of energy. Besides that, we also need to overcome important challenges related to fault tolerance in such extreme-scale systems. Fault-tolerance protocols have different energy consumption depending on parameters like the platform characteristics, the application features and the number of processes used in the execution. Currently, in order to evaluate the power consumption of fault tolerant protocols in an given execution context, the only approach is to run the application with the different versions of fault tolerant protocols and monitor the energy consumption. In order to avoid this time and energy consuming process, we propose in this chapter a methodology in order to estimate the energy consumption of the fault-tolerance protocols used in High-Performance Computing applications. Our methodology relies on an energy calibration of the supercomputer and a user description of the execution setting. We evaluate the accuracy of the estimations with applications and scenarios executed on a real platform with energy consumption monitoring. Results show that the energy estimations that we are able to provide before the executions are highly accurate and allow the users to select the less energy consuming fault-tolerance protocol without pre-running the application.
Fichier principal
Vignette du fichier
FTHPC2015_Aupy_Benoit_Diouri_Gluck_Lefevre.pdf (605.02 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01205153 , version 1 (10-12-2019)

Identifiants

  • HAL Id : hal-01205153 , version 1

Citer

Guillaume Aupy, Anne Benoit, Mohammed El Mehdi Diouri, Olivier Glück, Laurent Lefèvre. Energy-aware checkpointing strategies. Thomas Hérault; Yves Robert. Fault-Tolerance Techniques for High-Performance Computing, Springer, pp.279-317, 2015. ⟨hal-01205153⟩
107 Consultations
121 Téléchargements

Partager

Gmail Facebook X LinkedIn More