Distem: Evaluation of Fault Tolerance and Load Balancing Strategies in Real HPC Runtimes through Emulation

Cristian Ruiz 1 Joseph Emeras 1 Emmanuel Jeanvoine 1 Lucas Nussbaum 1
1 MADYNES - Management of dynamic networks and services
Inria Nancy - Grand Est, LORIA - NSS - Department of Networks, Systems and Services
Abstract : The era of Exascale computing raises new challenges for HPC. Intrinsic characteristics of those extreme scale platforms bring energy and reliability issues. To cope with those constraints, applications will have to be more flexible in order to deal with platform geometry evolutions and unavoidable failures. Thus, to prepare for this upcoming era, a strong effort must be made on improving the HPC software stack. This work focuses on improving the study of a central part of the software stack, the HPC runtimes. To this end we propose a set of extensions to the Distem emulator that enable the evaluation of fault tolerance and load balancing mechanisms in such runtimes. Extensive experimentation showing the benefits of our approach has been performed with three HPC runtimes: Charm++, MPICH, and OpenMPI.
Complete list of metadatas

Cited literature [15 references]  Display  Hide  Download

https://hal.inria.fr/hal-00949762
Contributor : Lucas Nussbaum <>
Submitted on : Monday, June 6, 2016 - 4:15:53 PM
Last modification on : Thursday, February 7, 2019 - 5:34:48 PM

Identifiers

  • HAL Id : hal-00949762, version 3

Citation

Cristian Ruiz, Joseph Emeras, Emmanuel Jeanvoine, Lucas Nussbaum. Distem: Evaluation of Fault Tolerance and Load Balancing Strategies in Real HPC Runtimes through Emulation. CCGRID - 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2016, Cartagena, Colombia. ⟨hal-00949762v3⟩

Share

Metrics

Record views

384

Files downloads

257