Dynamic Placement of Progress Thread for Overlapping MPI Non-Blocking Collectives on Manycore Processor

Abstract : To amortize the cost of MPI collective operations, non-blocking collectives have been proposed so as to allow communications to be overlapped with computation. Unfortunately, collective communications are more CPU-hungry than point-to-point communications and running them in a communication thread on a dedicated CPU core makes them slow. On the other hand, running collective communications on the application cores leads to no overlap. To address these issues, we propose an algorithm for tree-based collective operations that splits the tree between communication cores and application cores. To get the best of both worlds, the algorithm runs the short but heavy part of the tree on application cores, and the long but narrow part of the tree on one or several communication cores, so as to get a trade-off between overlap and absolute performance. We provide a model to study and predict its behavior and to tune its parameters. We implemented it in the MPC framework, which is a thread-based MPI implementation. We have run benchmarks on manycore processors such as the KNL and Skylake and get good results for both performance and overlap.
Complete list of metadatas

Cited literature [16 references]  Display  Hide  Download

https://hal.inria.fr/hal-01741787
Contributor : Alexandre Denis <>
Submitted on : Monday, March 26, 2018 - 1:25:01 PM
Last modification on : Thursday, May 16, 2019 - 6:46:13 PM

File

RR-9160.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01741787, version 2

Citation

Alexandre Denis, Julien Jaeger, Emmanuel Jeannot, Marc Pérache, Hugo Taboada. Dynamic Placement of Progress Thread for Overlapping MPI Non-Blocking Collectives on Manycore Processor. [Research Report] RR-9160, Inria Bordeaux Sud-Ouest. 2018, pp.1-12. ⟨hal-01741787v2⟩

Share

Metrics

Record views

196

Files downloads

474