Continuous Time Analysis of Momentum Methods

Authors: Nikola B. Kovachki, Andrew M. Stuart

JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through these approximation theorems, and accompanying numerical experiments, we make the following contributions to the understanding of momentum methods as often implemented within machine learning: We provide numerical experiments which illustrate the foregoing considerations, for simple linear test problems, and for the MNIST digit classification problem; in the latter case we consider SGD and thereby demonstrate that the conclusions of our theory have relevance for understanding the stochastic setting as well. To demonstrate that our analysis is indeed relevant in the stochastic setting, we train a deep autoencoder with mini-batching (stochastic) and verify that our convergence results still hold. The details of this experiment are given in section 5.
Researcher Affiliation Academia Nikola B. Kovachki EMAIL Computing and Mathematical Sciences California Institute of Technology Pasadena, CA 91125, USA; Andrew M. Stuart EMAIL Computing and Mathematical Sciences California Institute of Technology Pasadena, CA 91125, USA
Pseudocode No The paper describes various optimization methods and numerical schemes using mathematical equations (e.g., equations (6), (7), (9), (10), (15), (35), (36)) and detailed textual explanations, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes We train a deep autoencoder, using the architecture of Hinton and Salakhutdinov (2006) on the MNIST dataset Le Cun and Cortes (2010).
Dataset Splits No Since our work is concerned only with optimization and not generalization, we present our results only on the training set of 60,000 images and ignore the testing set.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU/CPU models or memory specifications.
Software Dependencies No The paper mentions "Widely used deep learning libraries such as TensorFlow Abadi et al. (2015) and Py Torch Paszke et al. (2017)" in a general context but does not specify the versions of any software dependencies used for their own experiments.
Experiment Setup Yes We fix an initialization of the autoencoder following Glorot and Bengio (2010) and use it to test every optimization method. Furthermore, we fix a batch size of 200 and train for 500 epochs, not shuffling the data set during training so that each method sees the same realization of the noise. We use the mean-squared error as our loss function. We were unable to train the autoencoder using (35) with h = 1 since λ = 0.9 implies an effective learning rate of 10 for which the system blows up. Since deep neural networks are not strongly convex, there is no single optimal choice of µ; we simply set µ = 1 in our experiments.