Skip to Main content Skip to Navigation
Preprints, Working Papers, ...

Distem: Evaluation of Fault Tolerance and Load Balancing Strategies in Real HPC Runtimes through Emulation

Cristian Ruiz 1 Joseph Emeras 1 Emmanuel Jeanvoine 1 Lucas Nussbaum 1
1 MADYNES - Management of dynamic networks and services
Inria Nancy - Grand Est, LORIA - NSS - Department of Networks, Systems and Services
Abstract : The era of Exascale computing raises new challenges for HPC. Intrinsic characteristics of those extreme scale platforms bring energy and reliability issues. To cope with those constraints, applications will have to be more flexible in order to deal with platform geometry evolutions and unavoidable failures. Thus, to prepare for this upcoming era, a strong effort must be made on improving the HPC software stack. This work focuses on improving the study of a central part of the software stack, the HPC runtimes. To this end we propose a set of extensions to the Distem emulator that enable the evaluation of fault tolerance and load balancing mechanisms in such runtimes. Extensive experimentation showing the benefits of our approach has been performed with three HPC runtimes: Charm++, MPICH, and OpenMPI.
Document type :
Preprints, Working Papers, ...
Complete list of metadatas

https://hal.inria.fr/hal-00949762
Contributor : Lucas Nussbaum <>
Submitted on : Sunday, January 10, 2016 - 9:14:43 PM
Last modification on : Tuesday, February 5, 2019 - 2:46:01 PM
Long-term archiving on: : Monday, April 11, 2016 - 10:58:14 AM

File

distem-ft.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00949762, version 2

Citation

Cristian Ruiz, Joseph Emeras, Emmanuel Jeanvoine, Lucas Nussbaum. Distem: Evaluation of Fault Tolerance and Load Balancing Strategies in Real HPC Runtimes through Emulation. 2016. ⟨hal-00949762v2⟩

Share

Metrics

Record views

76

Files downloads

76