Skip to Main content Skip to Navigation
Book sections

Energy-aware checkpointing strategies

Guillaume Aupy 1, 2, 3 Anne Benoit 1, 2, 3 Mohammed El Mehdi Diouri 4, 1, 3 Olivier Glück 4, 1, 3 Laurent Lefèvre 4, 1, 3
Abstract : Future extreme-scale supercomputers will gather hundreds of million cores. The main problem that we address is energy consumption since such systems will consume enormous amount of energy. Besides that, we also need to overcome important challenges related to fault tolerance in such extreme-scale systems. Fault-tolerance protocols have different energy consumption depending on parameters like the platform characteristics, the application features and the number of processes used in the execution. Currently, in order to evaluate the power consumption of fault tolerant protocols in an given execution context, the only approach is to run the application with the different versions of fault tolerant protocols and monitor the energy consumption. In order to avoid this time and energy consuming process, we propose in this chapter a methodology in order to estimate the energy consumption of the fault-tolerance protocols used in High-Performance Computing applications. Our methodology relies on an energy calibration of the supercomputer and a user description of the execution setting. We evaluate the accuracy of the estimations with applications and scenarios executed on a real platform with energy consumption monitoring. Results show that the energy estimations that we are able to provide before the executions are highly accurate and allow the users to select the less energy consuming fault-tolerance protocol without pre-running the application.
Complete list of metadata

Cited literature [48 references]  Display  Hide  Download
Contributor : Laurent Lefèvre Connect in order to contact the contributor
Submitted on : Tuesday, December 10, 2019 - 4:58:47 PM
Last modification on : Thursday, January 20, 2022 - 4:13:59 PM
Long-term archiving on: : Wednesday, March 11, 2020 - 9:58:02 PM


Files produced by the author(s)


  • HAL Id : hal-01205153, version 1



Guillaume Aupy, Anne Benoit, Mohammed El Mehdi Diouri, Olivier Glück, Laurent Lefèvre. Energy-aware checkpointing strategies. Thomas Hérault; Yves Robert. Fault-Tolerance Techniques for High-Performance Computing, Springer, pp.279-317, 2015. ⟨hal-01205153⟩



Les métriques sont temporairement indisponibles