Energy-aware checkpointing strategies - Archive ouverte HAL Access content directly
Book Sections Year : 2015

Energy-aware checkpointing strategies

(1, 2, 3) , (1, 2, 3) , (4, 1, 3) , (4, 1, 3) , (4, 1, 3)
1
2
3
4

Abstract

Future extreme-scale supercomputers will gather hundreds of million cores. The main problem that we address is energy consumption since such systems will consume enormous amount of energy. Besides that, we also need to overcome important challenges related to fault tolerance in such extreme-scale systems. Fault-tolerance protocols have different energy consumption depending on parameters like the platform characteristics, the application features and the number of processes used in the execution. Currently, in order to evaluate the power consumption of fault tolerant protocols in an given execution context, the only approach is to run the application with the different versions of fault tolerant protocols and monitor the energy consumption. In order to avoid this time and energy consuming process, we propose in this chapter a methodology in order to estimate the energy consumption of the fault-tolerance protocols used in High-Performance Computing applications. Our methodology relies on an energy calibration of the supercomputer and a user description of the execution setting. We evaluate the accuracy of the estimations with applications and scenarios executed on a real platform with energy consumption monitoring. Results show that the energy estimations that we are able to provide before the executions are highly accurate and allow the users to select the less energy consuming fault-tolerance protocol without pre-running the application.
Fichier principal
Vignette du fichier
FTHPC2015_Aupy_Benoit_Diouri_Gluck_Lefevre.pdf (605.02 Ko) Télécharger le fichier
Origin : Files produced by the author(s)
Loading...

Dates and versions

hal-01205153 , version 1 (10-12-2019)

Identifiers

  • HAL Id : hal-01205153 , version 1

Cite

Guillaume Aupy, Anne Benoit, Mohammed El Mehdi Diouri, Olivier Glück, Laurent Lefèvre. Energy-aware checkpointing strategies. Thomas Hérault; Yves Robert. Fault-Tolerance Techniques for High-Performance Computing, Springer, pp.279-317, 2015. ⟨hal-01205153⟩
103 View
102 Download

Share

Gmail Facebook Twitter LinkedIn More