Skip to Main content Skip to Navigation
Conference papers

Stochastic Gradient Descent: Going As Fast As Possible But Not Faster

Alice Sebag 1 Marc Schoenauer 2, 3 Michèle Sebag 2, 3
2 TAU - TAckling the Underspecified
Inria Saclay - Ile de France, LRI - Laboratoire de Recherche en Informatique
Abstract : When applied to training deep neural networks, stochastic gradient descent (SGD) often incurs steady progression phases, interrupted by catastrophic episodes in which loss and gradient norm explode. A possible mitigation of such events is to slow down the learning process. This paper presents a novel approach, called SALERA, to control the SGD learning rate, that uses two statistical tests. The first one, aimed at fast learning, compares the momentum of the normalized gradient vectors to that of random unit vectors and accordingly gracefully increases or decreases the learning rate. The second one is a change point detection test, aimed at the detection of catastrophic learning episodes; upon its triggering the learning rate is instantly halved. Experiments on standard benchmarks show that SALERA performs well in practice, and compares favorably to the state of the art. Machine Learning (ML) algorithms require efficient optimization techniques, whether to solve convex problems (e.g., for SVMs), or non-convex ones (e.g., for Neural Networks). As the data size and the model dimensionality increase, mainstream convex optimization methods are adversely affected. Overall, Stochastic Gradient Descent (SGD) is increasingly adopted. Within the SGD framework, one of the main issues is to know how to control the learning rate.The adequate speed depends both on the current state of the system (the weight vector) and the current mini-batch. Often, the eventual convergence of SGD is ensured by decaying the learning rate as O(t) [23, 6] or O(√ t) [29]. While learning rate decay effectively prevents catastrophic events (sudden rocketing of the training loss and gradient norm), many and diverse approaches have been designed to achieve quicker convergence through learning rate adaptation [1, 7, 24, 16, 26, 2] (more in Section 1). This paper proposes a novel approach to adaptive SGD, called SALERA (Safe Agnostic LEarning Rate Adaptation), based on the conjecture that, if catastrophes are well taken care of, the learning process can speed up whenever successive gradient directions show more correlation than random. The frequent advent of catastrophic episodes [11, Chapter 8], [3] raises the question of how to best mitigate their impact. Framing catastrophic episodes as random events,we adopt a purely curative strategy: detecting and instantly curing catastrophic episodes. Formally, a sequential cumulative sum change detection test, the Page-Hinkley (PH) test [20, 14] is adapted and used to monitor the learning curve reporting the minibatch losses. If a change in the learning curve is detected, the system undergoes an instant cure by halving the learning rate and backtracking to its former state.Once the risk of catastrophic episodes is well addressed, the learning rate can be adapted in a more agile manner: the ALERA (Agnostic LEarning Rate Adaptation) process increases (resp. decreases) the learning rate whenever the correlation among successive gradient directions is higher (resp. lower) than random, by comparing the actual gradient momentum and the agnostic momentum built from random unit vectors. The contribution of the paper is twofold. First, it proposes an original and efficient way to control learning dynamics (section 2.1). Secondly, it opens a new approach for handling catastrophic events and salvaging a OPTML 2017: 10th NIPS Workshop on Optimization for Machine Learning (NIPS 2017).
Document type :
Conference papers
Complete list of metadatas

Cited literature [29 references]  Display  Hide  Download
Contributor : Marc Schoenauer <>
Submitted on : Sunday, February 4, 2018 - 4:11:27 PM
Last modification on : Monday, November 16, 2020 - 8:38:05 AM
Long-term archiving on: : Thursday, May 3, 2018 - 2:22:40 PM


Files produced by the author(s)


  • HAL Id : hal-01700460, version 1



Alice Sebag, Marc Schoenauer, Michèle Sebag. Stochastic Gradient Descent: Going As Fast As Possible But Not Faster. OPTML 2017 : 10th NIPS Workshop on Optimization for Machine Learning, Alekh Agarwal (MSR, US), Ben Recht (UC Berkeley, US), Sashank J. Reddi (Google, US), Suvrit Sra (MIT, US), Dec 2017, Los Angeles, United States. pp.1-8. ⟨hal-01700460⟩



Record views


Files downloads