Skip to Main content Skip to Navigation
New interface
Conference papers

On Communication Determinism in Parallel HPC Applications

Abstract : Current fault tolerant protocols for high performance computing parallel applications have two major drawbacks: either they require to restart all processes even in the case of only a single process failure or they have a high performance overhead in fault free situation. As a consequence none of existing generic fault tolerant protocols matches needs of HPC applications and surprisingly, there is no fault tolerant protocol dedicated to them. One way to design better fault tolerant protocols for HPC applications is to explore and take advantage of their specific characteristics. In particular we suspect that most of them present some form of determinism in communication patterns. Communication determinism can play an important role in the design of new fault tolerant protocols by reducing their complexity. In this paper, we explore the communication determinism in 27 HPC parallel applications that are representative of production workloads in large scale centers. We show that most of these applications have deterministic or send-deterministic communication patterns.
Complete list of metadata

Cited literature [17 references]  Display  Hide  Download
Contributor : Amina Guermouche Connect in order to contact the contributor
Submitted on : Wednesday, December 12, 2018 - 4:34:58 PM
Last modification on : Tuesday, October 18, 2022 - 3:35:35 AM
Long-term archiving on: : Wednesday, March 13, 2019 - 3:43:36 PM


Files produced by the author(s)



Franck Cappello, Amina Guermouche, Marc Snir. On Communication Determinism in Parallel HPC Applications. 2010 Proceedings of 19th International Conference on Computer Communications and Networks, Aug 2010, Zurich, Switzerland. ⟨10.1109/ICCCN.2010.5560143⟩. ⟨hal-01953167⟩



Record views


Files downloads