Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training

Jiali Li; Bogdan Nicolae; Justin Wozniak; George Bosilca

doi:10.1109/MLHPC49564.2019.00006

Communication Dans Un Congrès Année : 2019

Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training

(1) , (2) , (2) , (1)

1
2

Jiali Li

Fonction : Auteur

Innovative Computing Laboratory [Knoxville]

Bogdan Nicolae

Fonction : Auteur
PersonId : 21945
IdHAL : bnicolae
ORCID : 0000-0002-0661-7509

Argonne National Laboratory [Lemont]

Justin Wozniak

Fonction : Auteur

Argonne National Laboratory [Lemont]

George Bosilca

Fonction : Auteur

Innovative Computing Laboratory [Knoxville]

Résumé

In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. With increasing complexity of learning models and amounts of training data, data-parallel approaches based on frequent all-reduce synchronization steps are increasingly popular. Despite the fact that high performance computing (HPC) technologies have been designed to address such patterns efficiently, the behavior of data-parallel approaches on HPC platforms is not well understood. To address this issue, in this paper we study the behavior of Horovod, a popular data parallel approach that relies on MPI, on Theta, a pre-Exascale machine at Argonne National Laboratory. Using two representative applications, we explore two aspects: (1) how performance and scalability is affected by important parameters such as number of nodes, number of workers, threads per node, batch size; (2) how computational phases are interleaved with all-reduce communication phases at fine granularity and what consequences this interleaving has in terms of potential bottlenecks. Our findings show that pipelining of back-propagation, gradient reduction and weight updates mitigate the effects of stragglers during all-reduce only partially. Furthermore, there can be significant delays between weight updates, which can be leveraged to mask the overhead of additional background operations that are coupled with the training.

Mots clés

deep learning data-parallel training behavior analysis

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

Understanding_Scalability_and_Fine_Grain_Parallelism_of_Synchronous_Data_Parallel_Training.pdf (801.99 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Bogdan Nicolae : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02570148

Soumis le : lundi 11 mai 2020-18:35:55

Dernière modification le : mercredi 13 mai 2020-09:50:07

Dates et versions

hal-02570148 , version 1 (11-05-2020)

Identifiants

HAL Id : hal-02570148 , version 1
DOI : 10.1109/MLHPC49564.2019.00006

Citer

Jiali Li, Bogdan Nicolae, Justin Wozniak, George Bosilca. Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training. 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), Nov 2019, Denver, United States. pp.1-8, ⟨10.1109/MLHPC49564.2019.00006⟩. ⟨hal-02570148⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

65 Consultations

128 Téléchargements

Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Altmetric

Partager