Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management

Andi Drebes; Antoniu Pop; Karine Heydemann; Albert Cohen; Nathalie Drach

doi:10.1145/2967938.2967946

Communication Dans Un Congrès Année : 2016

Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management

(1) , (1) , (2) , (3) , (2)

1
2
3

Andi Drebes

Fonction : Auteur
PersonId : 997185

School of Computer Science [Manchester]

Antoniu Pop

Fonction : Auteur
PersonId : 997186

School of Computer Science [Manchester]

Karine Heydemann

Fonction : Auteur
PersonId : 8179
IdHAL : karine-heydemann
IdRef : 082986762

Architecture et Logiciels pour Systèmes Embarqués sur Puce

Albert Cohen

Fonction : Auteur
PersonId : 6894
IdHAL : acohen
ORCID : 0000-0002-8866-5343
IdRef : 067155898

Parallélisme de Kahn Synchrone

Nathalie Drach

Fonction : Auteur
PersonId : 997187

Architecture et Logiciels pour Systèmes Embarqués sur Puce

Résumé

Dynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. Yet these promises are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform abstraction of both computing and memory resources for task-parallel programming models while achieving high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system and placement information from the operating system. We achieve 94% of local memory accesses on a 192-core system with 24 NUMA nodes, up to 5× higher performance than NUMA-aware hierarchical work-stealing, and even 5.6× compared to static interleaved allocation. Finally, we show that state-of-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications.

Mots clés

Task-parallel programming NUMA Scheduling Memory allocation Data-flow programming

Domaines

Langage de programmation [cs.PL]

Fichier principal

paper.pdf (1.61 Mo)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Albert Cohen : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01425743

Soumis le : dimanche 29 janvier 2017-17:31:21

Dernière modification le : vendredi 19 avril 2024-16:18:57

Archivage à long terme le : dimanche 30 avril 2017-12:25:16

Dates et versions

hal-01425743 , version 1 (29-01-2017)

Identifiants

HAL Id : hal-01425743 , version 1
DOI : 10.1145/2967938.2967946

Citer

Andi Drebes, Antoniu Pop, Karine Heydemann, Albert Cohen, Nathalie Drach. Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management. PACT'16 - ACM/IEEE Conference on Parallel Architectures and Compilation Techniques, Sep 2016, Haifa, Israel. pp.125 - 137, ⟨10.1145/2967938.2967946⟩. ⟨hal-01425743⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS UPMC CNRS INRIA LIP6 INRIA2 PSL SORBONNE-UNIVERSITE SU-SCIENCES

797 Consultations

616 Téléchargements

Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager