Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware Staggering

Kevin Assogba; M Mustafa Rafique; Bogdan Nicolae

Communication Dans Un Congrès Année : 2023

Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware Staggering

(1) , (1) , (2)

1
2

Kevin Assogba

Fonction : Auteur

Rochester Institute of Technology

M Mustafa Rafique

Fonction : Auteur
PersonId : 1076142
ORCID : 0000-0002-5034-2880

Rochester Institute of Technology

Bogdan Nicolae

Fonction : Auteur
PersonId : 21945
IdHAL : bnicolae
ORCID : 0000-0002-0661-7509

Argonne National Laboratory [Lemont]

Résumé

Despite significant advances, training deep learning models remains a time-consuming and resource-intensive task. One of the key challenges in this context is the ingestion of the training data, which involves non-trivial overheads: read the training data from a remote repository, apply augmentations and transformations, shuffle the training samples, and assemble them into mini-batches. Despite the introduction of abstractions such as data pipelines that aim to hide such overheads asynchronously, it is often the case that the data ingestion is slower than the training, causing a delay at each training iteration. This problem is further augmented when training multiple deep learning models simultaneously on powerful compute nodes that feature multiple GPUs. In this case, the training data is often reused across different training instances (e.g., in the case of multi-model or ensemble training) or even within the same training instance (e.g., data-parallel training). However, transparent caching solutions (e.g., OS-level POSIX caching) are not suitable to directly mitigate the competition between training instances that reuse the same training data. In this paper, we study the problem of how to minimize the makespan of running two training instances that reuse the same training data. The makespan is subject to a trade-off: if the training instances start at the same time, competition for I/O bandwidth slows down the data pipelines and increases the makespan. If one training instance is staggered, competition is reduced but the makespan increases. We aim to optimize this trade-off by proposing a performance model capable of predicting the makespan based on the staggering between the training instances, which can be used to find the optimal staggering that triggers just enough competition to make optimal use of transparent caching in order to minimize the makespan. Experiments with different combinations of learning models using the same training data demonstrate that (1) staggering is important to minimize the makespan; (2) our performance model is accurate and can predict the optimal staggering in advance based on calibration overhead.

Mots clés

Deep Learning Caching and Reuse of Training Data Co-Located Training Performance Modeling

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

HiPC23_DL_cache_aware_scheduling_for_optimized_DL_data_IO.pdf (523.33 Ko)

Origine : Fichiers produits par l'(les) auteur(s)
Licence : CC BY - Paternité

Bogdan Nicolae : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04343672

Soumis le : jeudi 14 décembre 2023-00:30:16

Dernière modification le : mercredi 27 mars 2024-11:44:03

Dates et versions

hal-04343672 , version 1 (14-12-2023)

Licence

Paternité

Identifiants

HAL Id : hal-04343672 , version 1

Citer

Kevin Assogba, M Mustafa Rafique, Bogdan Nicolae. Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware Staggering. HIPC’23: 30th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2023, Goa, India. ⟨hal-04343672⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

78 Consultations

60 Téléchargements

Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware Staggering

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Partager