Dynamic Placement of Progress Thread for Overlapping MPI Non-Blocking Collectives on Manycore Processor

Alexandre Denis; Julien Jaeger; Emmanuel Jeannot; Marc Pérache; Hugo Taboada

Rapport (Rapport De Recherche) Année : 2018

Dynamic Placement of Progress Thread for Overlapping MPI Non-Blocking Collectives on Manycore Processor

Recouvrement des collectives mpi non-bloquantes sur processeur manycore

(1, 2) , (3) , (1, 2) , (3) , (1, 2, 3)

1
2
3

Alexandre Denis

Fonction : Auteur
PersonId : 103
IdHAL : adenis
ORCID : 0000-0001-8606-4344
IdRef : 225733218

Topology-Aware System-Scale Data Management for High-Performance Computing

Laboratoire Bordelais de Recherche en Informatique

Julien Jaeger

Fonction : Auteur

DAM Île-de-France

Emmanuel Jeannot

Fonction : Auteur
PersonId : 15678
IdHAL : emmanuel-jeannot
ORCID : 0000-0002-3956-2997
IdRef : 084595108

Topology-Aware System-Scale Data Management for High-Performance Computing

Laboratoire Bordelais de Recherche en Informatique

Marc Pérache

Fonction : Auteur

DAM Île-de-France

Hugo Taboada

Fonction : Auteur
PersonId : 987793

Topology-Aware System-Scale Data Management for High-Performance Computing

Laboratoire Bordelais de Recherche en Informatique

DAM Île-de-France

Résumé

To amortize the cost of MPI collective operations, non-blocking collectives have been proposed so as to allow communications to be overlapped with computation. Unfortunately, collective communications are more CPU-hungry than point-to-point communications and running them in a communication thread on a dedicated CPU core makes them slow. On the other hand, running collective communications on the application cores leads to no overlap. To address these issues, we propose an algorithm for tree-based collective operations that splits the tree between communication cores and application cores. To get the best of both worlds, the algorithm runs the short but heavy part of the tree on application cores, and the long but narrow part of the tree on one or several communication cores, so as to get a trade-off between overlap and absolute performance. We provide a model to study and predict its behavior and to tune its parameters. We implemented it in the MPC framework, which is a thread-based MPI implementation. We have run benchmarks on manycore processors such as the KNL and Skylake and get good results for both performance and overlap.

Les collectives MPI non-bloquantes ont été proposées pour recouvrir les communications par du calcul afin d’en amortir le coût. Cependant, ces opérations consomment plus de temps CPUque les opérations point-à-point. L’utilisation d’un seul CPU dédié aux threads de progression n’est donc pas efficace et rend les communications lentes. D’un autre côté, si les communications sont exécutées sur les coeurs applicatifs, aucun recouvrement n’est obtenu. Pour aborder ce problème, nous proposons un algorithme pour les opérations collectives en arbre qui scinde l’arbre des communications entre les coeurs applicatifs et les coeurs dédiés aux communications afin d’obtenir un compromis entre le taux de recouvrement et les performances globales. Nous proposons un modèle afin d’étudier et prédire son comportement puis l’avons implémenté dans le framework MPC. Nous avons obtenu de bons résultats en testant notre approche sur des processeurs manycores tels que le KNL et le Skylake.

Mots clés

non-blocking collectives communication/computation overlap MPI

collectives non-bloquantes Recouvrement Thread de progression

Domaines

Réseaux et télécommunications [cs.NI]

Fichier principal

RR-9160.pdf (1.33 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Alexandre Denis : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01741787

Soumis le : lundi 26 mars 2018-13:25:01

Dernière modification le : mercredi 3 avril 2024-11:24:09

Dates et versions

hal-01741787 , version 1 (23-03-2018)

hal-01741787 , version 2 (26-03-2018)

Identifiants

HAL Id : hal-01741787 , version 2

Citer

Alexandre Denis, Julien Jaeger, Emmanuel Jeannot, Marc Pérache, Hugo Taboada. Dynamic Placement of Progress Thread for Overlapping MPI Non-Blocking Collectives on Manycore Processor. [Research Report] RR-9160, Inria Bordeaux Sud-Ouest. 2018, pp.1-12. ⟨hal-01741787v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CEA CNRS INRIA INRIA-RRRT DAM INRIA2 LARA

225 Consultations

411 Téléchargements

Dynamic Placement of Progress Thread for Overlapping MPI Non-Blocking Collectives on Manycore Processor

Recouvrement des collectives mpi non-bloquantes sur processeur manycore

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager