Dynamic Backup Workers for Parallel Machine Learning

Chuan Xu; Giovanni Neglia; Nicola Sebastianelli

Communication Dans Un Congrès Année : 2020

Dynamic Backup Workers for Parallel Machine Learning

(1) , (1) , (1)

Chuan Xu

Fonction : Auteur
PersonId : 748898
IdHAL : chuan-xu
IdRef : 224356755

Network Engineering and Operations

Giovanni Neglia

Fonction : Auteur
PersonId : 1683
IdHAL : giovanni-neglia
ORCID : 0000-0001-8779-0620
IdRef : 18310966X

Network Engineering and Operations

Nicola Sebastianelli

Fonction : Auteur

Network Engineering and Operations

Résumé

The most popular framework for parallel training of machine learning models is the (synchronous) parameter server (PS). This paradigm consists of n workers and a stateful PS, which waits for the responses of every worker's computation to proceed to the next iteration. Transient computation slowdowns or transmission delays can intolerably lengthen the time of each iteration. An efficient way to mitigate this problem is to let the PS wait only for the fastest n − b updates, before generating the new parameters. The slowest b workers are called backup workers. The optimal number b of backup workers depends on the cluster configuration and workload, but also (as we show in this paper) on the current stage of the training. We propose DBW, an algorithm that dynamically decides the number of backup workers during the training process to maximize the convergence speed at each iteration. Our experiments show that DBW 1) removes the necessity to tune b by preliminary time-consuming experiments, and 2) makes the training up to a factor 3 faster than the optimal static configuration.

Domaines

Intelligence artificielle [cs.AI] Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

Dynamic_Backup_Workers_Networking (2).pdf (1.26 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

CHUAN XU : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03044393

Soumis le : lundi 7 décembre 2020-17:38:07

Dernière modification le : mercredi 15 mars 2023-08:58:09

Archivage à long terme le : lundi 8 mars 2021-19:39:19

Dates et versions

hal-03044393 , version 1 (07-12-2020)

Identifiants

HAL Id : hal-03044393 , version 1

Citer

Chuan Xu, Giovanni Neglia, Nicola Sebastianelli. Dynamic Backup Workers for Parallel Machine Learning. IFIP Networking 2020, Jun 2020, Paris / Online, France. ⟨hal-03044393⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INRIA INRIA2 UNIV-COTEDAZUR OPAL

37 Consultations

50 Téléchargements

Dynamic Backup Workers for Parallel Machine Learning

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager