scholarly journals Multilevel Stochastic Gradient Methods for Nested Composition Optimization

2019 ◽  
Vol 29 (1) ◽  
pp. 616-659 ◽  
Author(s):  
Shuoguang Yang ◽  
Mengdi Wang ◽  
Ethan X. Fang
2020 ◽  
Vol 8 ◽  
Author(s):  
Hoonyoung Jeong ◽  
Alexander Y. Sun ◽  
Jonghyeon Jeon ◽  
Baehyun Min ◽  
Daein Jeong

2014 ◽  
Vol 34 (3) ◽  
pp. 373-393 ◽  
Author(s):  
Nataša Krejić ◽  
Nataša Krklec Jerinkić

2020 ◽  
Vol 363 ◽  
pp. 112909 ◽  
Author(s):  
André Gustavo Carlon ◽  
Ben Mansour Dia ◽  
Luis Espath ◽  
Rafael Holdorf Lopez ◽  
Raúl Tempone

Author(s):  
Derek Driggs ◽  
Matthias J. Ehrhardt ◽  
Carola-Bibiane Schönlieb

Abstract Variance reduction is a crucial tool for improving the slow convergence of stochastic gradient descent. Only a few variance-reduced methods, however, have yet been shown to directly benefit from Nesterov’s acceleration techniques to match the convergence rates of accelerated gradient methods. Such approaches rely on “negative momentum”, a technique for further variance reduction that is generally specific to the SVRG gradient estimator. In this work, we show for the first time that negative momentum is unnecessary for acceleration and develop a universal acceleration framework that allows all popular variance-reduced methods to achieve accelerated convergence rates. The constants appearing in these rates, including their dependence on the number of functions n, scale with the mean-squared-error and bias of the gradient estimator. In a series of numerical experiments, we demonstrate that versions of SAGA, SVRG, SARAH, and SARGE using our framework significantly outperform non-accelerated versions and compare favourably with algorithms using negative momentum.


Author(s):  
Beitong Zhou ◽  
Jun Liu ◽  
Weigao Sun ◽  
Ruijuan Chen ◽  
Claire Tomlin ◽  
...  

We propose a novel technique for improving the stochastic gradient descent (SGD) method to train deep networks, which we term pbSGD. The proposed pbSGD method simply raises the stochastic gradient to a certain power elementwise during iterations and introduces only one additional parameter, namely, the power exponent (when it equals to 1, pbSGD reduces to SGD). We further propose pbSGD with momentum, which we term pbSGDM. The main results of this paper present comprehensive experiments on popular deep learning models and benchmark datasets. Empirical results show that the proposed pbSGD and pbSGDM obtain faster initial training speed than adaptive gradient methods, comparable generalization ability with SGD, and improved robustness to hyper-parameter selection and vanishing gradients. pbSGD is essentially a gradient modifier via a nonlinear transformation. As such, it is orthogonal and complementary to other techniques for accelerating gradient-based optimization such as learning rate schedules. Finally, we show convergence rate analysis for both pbSGD and pbSGDM methods. The theoretical rates of convergence match the best known theoretical rates of convergence for SGD and SGDM methods on nonconvex functions.


Sign in / Sign up

Export Citation Format

Share Document