Stochastic Gradient Descent: Going As Fast As Possible But Not Faster

Alice Schoenauer Sebag; Marc Schoenauer; Michèle Sebag

Communication Dans Un Congrès Année : 2017

Stochastic Gradient Descent: Going As Fast As Possible But Not Faster

(1) , (2, 3) , (2, 3)

1
2
3

Alice Schoenauer Sebag

Fonction : Auteur

Altschuler&Wu lab

Marc Schoenauer

Fonction : Auteur
PersonId : 739309
IdHAL : evomarc
ORCID : 0000-0003-1450-6830
IdRef : 057775575

TAckling the Underspecified

Laboratoire de Recherche en Informatique

Michèle Sebag

Fonction : Auteur
PersonId : 836537

TAckling the Underspecified

Laboratoire de Recherche en Informatique

Résumé

When applied to training deep neural networks, stochastic gradient descent (SGD) often incurs steady progression phases, interrupted by catastrophic episodes in which loss and gradient norm explode. A possible mitigation of such events is to slow down the learning process. This paper presents a novel approach, called SALERA, to control the SGD learning rate, that uses two statistical tests. The first one, aimed at fast learning, compares the momentum of the normalized gradient vectors to that of random unit vectors and accordingly gracefully increases or decreases the learning rate. The second one is a change point detection test, aimed at the detection of catastrophic learning episodes; upon its triggering the learning rate is instantly halved. Experiments on standard benchmarks show that SALERA performs well in practice, and compares favorably to the state of the art. Machine Learning (ML) algorithms require efficient optimization techniques, whether to solve convex problems (e.g., for SVMs), or non-convex ones (e.g., for Neural Networks). As the data size and the model dimensionality increase, mainstream convex optimization methods are adversely affected. Overall, Stochastic Gradient Descent (SGD) is increasingly adopted. Within the SGD framework, one of the main issues is to know how to control the learning rate.The adequate speed depends both on the current state of the system (the weight vector) and the current mini-batch. Often, the eventual convergence of SGD is ensured by decaying the learning rate as O(t) [23, 6] or O(√ t) [29]. While learning rate decay effectively prevents catastrophic events (sudden rocketing of the training loss and gradient norm), many and diverse approaches have been designed to achieve quicker convergence through learning rate adaptation [1, 7, 24, 16, 26, 2] (more in Section 1). This paper proposes a novel approach to adaptive SGD, called SALERA (Safe Agnostic LEarning Rate Adaptation), based on the conjecture that, if catastrophes are well taken care of, the learning process can speed up whenever successive gradient directions show more correlation than random. The frequent advent of catastrophic episodes [11, Chapter 8], [3] raises the question of how to best mitigate their impact. Framing catastrophic episodes as random events,we adopt a purely curative strategy: detecting and instantly curing catastrophic episodes. Formally, a sequential cumulative sum change detection test, the Page-Hinkley (PH) test [20, 14] is adapted and used to monitor the learning curve reporting the minibatch losses. If a change in the learning curve is detected, the system undergoes an instant cure by halving the learning rate and backtracking to its former state.Once the risk of catastrophic episodes is well addressed, the learning rate can be adapted in a more agile manner: the ALERA (Agnostic LEarning Rate Adaptation) process increases (resp. decreases) the learning rate whenever the correlation among successive gradient directions is higher (resp. lower) than random, by comparing the actual gradient momentum and the agnostic momentum built from random unit vectors. The contribution of the paper is twofold. First, it proposes an original and efficient way to control learning dynamics (section 2.1). Secondly, it opens a new approach for handling catastrophic events and salvaging a OPTML 2017: 10th NIPS Workshop on Optimization for Machine Learning (NIPS 2017).

Domaines

Intelligence artificielle [cs.AI]

Fichier principal

OPT2017_paper_14.pdf (170.81 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Marc Schoenauer : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01700460

Soumis le : dimanche 4 février 2018-16:11:27

Dernière modification le : lundi 12 février 2024-09:44:03

Archivage à long terme le : jeudi 3 mai 2018-14:22:40

Dates et versions

hal-01700460 , version 1 (04-02-2018)

Identifiants

HAL Id : hal-01700460 , version 1

Citer

Alice Schoenauer Sebag, Marc Schoenauer, Michèle Sebag. Stochastic Gradient Descent: Going As Fast As Possible But Not Faster. OPTML 2017 : 10th NIPS Workshop on Optimization for Machine Learning, Alekh Agarwal (MSR, US), Ben Recht (UC Berkeley, US), Sashank J. Reddi (Google, US), Suvrit Sra (MIT, US), Dec 2017, Los Angeles, United States. pp.1-8. ⟨hal-01700460⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UMR8623 CENTRALESUPELEC INRIA2 LRI-AO UNIV-PARIS-SACLAY LISN GS-ENGINEERING GS-COMPUTER-SCIENCE LISN-AO

643 Consultations

406 Téléchargements

Stochastic Gradient Descent: Going As Fast As Possible But Not Faster

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager