Skip to Main content Skip to Navigation
Conference papers

Dynamic Backup Workers for Parallel Machine Learning

Chuan Xu 1 Giovanni Neglia 1 Nicola Sebastianelli 1
1 NEO - Network Engineering and Operations
CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : The most popular framework for parallel training of machine learning models is the (synchronous) parameter server (PS). This paradigm consists of n workers and a stateful PS, which waits for the responses of every worker's computation to proceed to the next iteration. Transient computation slowdowns or transmission delays can intolerably lengthen the time of each iteration. An efficient way to mitigate this problem is to let the PS wait only for the fastest n − b updates, before generating the new parameters. The slowest b workers are called backup workers. The optimal number b of backup workers depends on the cluster configuration and workload, but also (as we show in this paper) on the current stage of the training. We propose DBW, an algorithm that dynamically decides the number of backup workers during the training process to maximize the convergence speed at each iteration. Our experiments show that DBW 1) removes the necessity to tune b by preliminary time-consuming experiments, and 2) makes the training up to a factor 3 faster than the optimal static configuration.
Complete list of metadata
Contributor : Chuan Xu Connect in order to contact the contributor
Submitted on : Monday, December 7, 2020 - 5:38:07 PM
Last modification on : Wednesday, November 3, 2021 - 5:04:08 AM
Long-term archiving on: : Monday, March 8, 2021 - 7:39:19 PM


Files produced by the author(s)


  • HAL Id : hal-03044393, version 1



Chuan Xu, Giovanni Neglia, Nicola Sebastianelli. Dynamic Backup Workers for Parallel Machine Learning. IFIP Networking 2020, Jun 2020, Paris / Online, France. ⟨hal-03044393⟩



Les métriques sont temporairement indisponibles