On Learning Rates and Schrödinger Operators

Authors: Bin Shi, Weijie Su, Michael I. Jordan

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Understanding the iterative behavior of stochastic optimization algorithms for minimizing nonconvex functions remains a crucial challenge in demystifying deep learning. In particular, it is not yet understood why certain simple techniques are remarkably effective for tuning the learning rate in stochastic gradient descent (SGD)... As a numerical illustration of this complexity, Figure 1 plots the error of SGD with a piecewise constant learning rate in the training of a neural network on the CIFAR-10 dataset.
Researcher Affiliation Academia Bin Shi EMAIL Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China School of Mathematical Sciences University of Chinese Academy of Sciences Beijing, 100049, China Weijie J. Su EMAIL Department of Statistics and Data Science University of Pennsylvania Philadelphia, PA 19104, USA Michael I. Jordan EMAIL Department of Electrical Engineering and Computer Sciences Department of Statistics University of California Berkeley, CA 94720, USA
Pseudocode No The paper describes algorithms like SGD and SGLD using mathematical equations (e.g., xk+1 = xk s e f(xk)) and continuous-time SDEs. However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the methodology described, nor does it provide links to any code repositories. The license information provided is for the paper itself, not for accompanying code.
Open Datasets Yes As a numerical illustration of this complexity, Figure 1 plots the error of SGD with a piecewise constant learning rate in the training of a neural network on the CIFAR-10 dataset. With a constant learning rate, SGD quickly reaches a plateau in terms of training error, and whenever the learning rate decreases, the plateau decreases as well, thereby yielding better optimization performance. This illustration exemplifies the idea of learning rate decay, a technique that is used in training deep neural networks (see, e.g., He et al., 2016; Bottou et al., 2018; Sordello and Su, 2019).
Dataset Splits No Figure 1 mentions training a neural network on CIFAR-10 and showing 'training error'. However, it does not specify any training/validation/test splits, their percentages, or how they were created for reproduction.
Hardware Specification No The paper does not provide specific details about the hardware used to run the numerical illustrations or experiments, such as GPU models, CPU types, or memory configurations. It only mentions 'Matlab2019b' in Figure 5 caption, which is software.
Software Dependencies No The paper mentions 'Matlab2019b' as a tool used for generating some figures ('using the noise generator state 1-10000 in Matlab2019b' in Figure 5 caption). However, it does not provide a comprehensive list of software dependencies with specific version numbers required to replicate the experiments or implement the described methodology.
Experiment Setup Yes Figure 1: Training error using SGD with mini-batch size 32 to train an 8-layer convolutional neural network on CIFAR-10 Krizhevsky (2009). The first 90 epochs use a learning rate of s = 0.006, the next 120 epochs use s = 0.003, and the final 190 epochs use s = 0.0005. Figure 3: The learning rate is set to either s = 0.1 or s = 0.05. ... The gradient noise is drawn from the standard normal distribution. All results are averaged over 10000 independent replications.