Distem: Evaluation of Fault Tolerance and Load Balancing Strategies in Real HPC Runtimes through Emulation

Cristian Ruiz 1 Joseph Emeras 1 Emmanuel Jeanvoine 1 Lucas Nussbaum 1
1 MADYNES - Management of dynamic networks and services
Inria Nancy - Grand Est, LORIA - NSS - Department of Networks, Systems and Services
Abstract : The era of Exascale computing raises new challenges for HPC. Intrinsic characteristics of those extreme scale platforms bring energy and reliability issues. To cope with those constraints, applications will have to be more flexible in order to deal with platform geometry evolutions and unavoidable failures. Thus, to prepare for this upcoming era, a strong effort must be made on improving the HPC software stack. This work focuses on improving the study of a central part of the software stack, the HPC runtimes. To this end we propose a set of extensions to the Distem emulator that enable the evaluation of fault tolerance and load balancing mechanisms in such runtimes. Extensive experimentation showing the benefits of our approach has been performed with three HPC runtimes: Charm++, MPICH, and OpenMPI.
Type de document :
Communication dans un congrès
CCGRID - 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2016, Cartagena, Colombia
Liste complète des métadonnées

Littérature citée [15 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00949762
Contributeur : Lucas Nussbaum <>
Soumis le : lundi 6 juin 2016 - 16:15:53
Dernière modification le : jeudi 11 janvier 2018 - 06:25:23

Identifiants

  • HAL Id : hal-00949762, version 3

Citation

Cristian Ruiz, Joseph Emeras, Emmanuel Jeanvoine, Lucas Nussbaum. Distem: Evaluation of Fault Tolerance and Load Balancing Strategies in Real HPC Runtimes through Emulation. CCGRID - 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2016, Cartagena, Colombia. 〈hal-00949762v3〉

Partager

Métriques

Consultations de la notice

314

Téléchargements de fichiers

106