On Communication Determinism in Parallel HPC Applications

Abstract : Current fault tolerant protocols for high performance computing parallel applications have two major drawbacks: either they require to restart all processes even in the case of only a single process failure or they have a high performance overhead in fault free situation. As a consequence none of existing generic fault tolerant protocols matches needs of HPC applications and surprisingly, there is no fault tolerant protocol dedicated to them. One way to design better fault tolerant protocols for HPC applications is to explore and take advantage of their specific characteristics. In particular we suspect that most of them present some form of determinism in communication patterns. Communication determinism can play an important role in the design of new fault tolerant protocols by reducing their complexity. In this paper, we explore the communication determinism in 27 HPC parallel applications that are representative of production workloads in large scale centers. We show that most of these applications have deterministic or send-deterministic communication patterns.
Complete list of metadatas

Cited literature [6 references]  Display  Hide  Download

https://hal.inria.fr/hal-01953167
Contributor : Amina Guermouche <>
Submitted on : Wednesday, December 12, 2018 - 4:34:58 PM
Last modification on : Thursday, August 1, 2019 - 2:12:06 PM
Long-term archiving on : Wednesday, March 13, 2019 - 3:43:36 PM

File

icccn2010.pdf
Files produced by the author(s)

Identifiers

Citation

Franck Cappello, Amina Guermouche, Marc Snir. On Communication Determinism in Parallel HPC Applications. 2010 Proceedings of 19th International Conference on Computer Communications and Networks, Aug 2010, Zurich, Switzerland. ⟨10.1109/ICCCN.2010.5560143⟩. ⟨hal-01953167⟩

Share

Metrics

Record views

34

Files downloads

490