ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance protocols during HPC executions

Mohammed El Mehdi Diouri 1, 2 Olivier Glück 1, 2 Laurent Lefèvre 1, 2, 3 Franck Cappello 4, 5, 6
2 AVALON - Algorithms and Software Architectures for Distributed and HPC Platforms
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
4 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : Energy consumption and fault tolerance are two interrelated issues to address for designing future exascale systems. Fault tolerance protocols used for checkpointing have different energy consumption depending on parameters like application features, number of processes in the execution and platform characteristics. Currently, the only way to select a protocol for a given execution is to run the application and monitor the energy consumption of different fault tolerance protocols. This is needed for any variation of the execution setting. To avoid this time and energy consuming process, we propose an energy estimation framework. It relies on an energy calibration of the considered platform and a user description of the execution setting. We evaluate the accuracy of our estimations with real applications running on a real platform with energy consumption monitoring. Results show that our estimations are highly accurate and allow selecting the best fault tolerant protocol without pre-executing the application.
Type de document :
Communication dans un congrès
13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), May 2013, Delft, Netherlands. 2013
Liste complète des métadonnées

https://hal.inria.fr/hal-00806500
Contributeur : Mohammed El Mehdi Diouri <>
Soumis le : dimanche 31 mars 2013 - 22:40:56
Dernière modification le : vendredi 20 avril 2018 - 15:44:26

Identifiants

  • HAL Id : hal-00806500, version 1

Citation

Mohammed El Mehdi Diouri, Olivier Glück, Laurent Lefèvre, Franck Cappello. ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance protocols during HPC executions. 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), May 2013, Delft, Netherlands. 2013. 〈hal-00806500〉

Partager

Métriques

Consultations de la notice

435