Skip to Main content Skip to Navigation
Book sections

Energy-aware checkpointing strategies

Guillaume Aupy 1, 2, 3 Anne Benoit 1, 2, 3 Mohammed El Mehdi Diouri 4, 1, 3 Olivier Glück 4, 1, 3 Laurent Lefèvre 4, 1, 3
Abstract : Future extreme-scale supercomputers will gather hundreds of million cores. The main problem that we address is energy consumption since such systems will consume enormous amount of energy. Besides that, we also need to overcome important challenges related to fault tolerance in such extreme-scale systems. Fault-tolerance protocols have different energy consumption depending on parameters like the platform characteristics, the application features and the number of processes used in the execution. Currently, in order to evaluate the power consumption of fault tolerant protocols in an given execution context, the only approach is to run the application with the different versions of fault tolerant protocols and monitor the energy consumption. In order to avoid this time and energy consuming process, we propose in this chapter a methodology in order to estimate the energy consumption of the fault-tolerance protocols used in High-Performance Computing applications. Our methodology relies on an energy calibration of the supercomputer and a user description of the execution setting. We evaluate the accuracy of the estimations with applications and scenarios executed on a real platform with energy consumption monitoring. Results show that the energy estimations that we are able to provide before the executions are highly accurate and allow the users to select the less energy consuming fault-tolerance protocol without pre-running the application.
Complete list of metadatas

Cited literature [48 references]  Display  Hide  Download

https://hal.inria.fr/hal-01205153
Contributor : Laurent Lefèvre <>
Submitted on : Tuesday, December 10, 2019 - 4:58:47 PM
Last modification on : Monday, May 4, 2020 - 11:38:46 AM
Document(s) archivé(s) le : Wednesday, March 11, 2020 - 9:58:02 PM

File

FTHPC2015_Aupy_Benoit_Diouri_G...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01205153, version 1

Collections

Citation

Guillaume Aupy, Anne Benoit, Mohammed El Mehdi Diouri, Olivier Glück, Laurent Lefèvre. Energy-aware checkpointing strategies. Thomas Hérault; Yves Robert. Fault-Tolerance Techniques for High-Performance Computing, Springer, pp.279-317, 2015. ⟨hal-01205153⟩

Share

Metrics

Record views

309

Files downloads

215