Continuous Time Analysis of Momentum Methods
Authors: Nikola B. Kovachki, Andrew M. Stuart
JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through these approximation theorems, and accompanying numerical experiments, we make the following contributions to the understanding of momentum methods as often implemented within machine learning: We provide numerical experiments which illustrate the foregoing considerations, for simple linear test problems, and for the MNIST digit classification problem; in the latter case we consider SGD and thereby demonstrate that the conclusions of our theory have relevance for understanding the stochastic setting as well. To demonstrate that our analysis is indeed relevant in the stochastic setting, we train a deep autoencoder with mini-batching (stochastic) and verify that our convergence results still hold. The details of this experiment are given in section 5. |
| Researcher Affiliation | Academia | Nikola B. Kovachki EMAIL Computing and Mathematical Sciences California Institute of Technology Pasadena, CA 91125, USA; Andrew M. Stuart EMAIL Computing and Mathematical Sciences California Institute of Technology Pasadena, CA 91125, USA |
| Pseudocode | No | The paper describes various optimization methods and numerical schemes using mathematical equations (e.g., equations (6), (7), (9), (10), (15), (35), (36)) and detailed textual explanations, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We train a deep autoencoder, using the architecture of Hinton and Salakhutdinov (2006) on the MNIST dataset Le Cun and Cortes (2010). |
| Dataset Splits | No | Since our work is concerned only with optimization and not generalization, we present our results only on the training set of 60,000 images and ignore the testing set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU/CPU models or memory specifications. |
| Software Dependencies | No | The paper mentions "Widely used deep learning libraries such as TensorFlow Abadi et al. (2015) and Py Torch Paszke et al. (2017)" in a general context but does not specify the versions of any software dependencies used for their own experiments. |
| Experiment Setup | Yes | We fix an initialization of the autoencoder following Glorot and Bengio (2010) and use it to test every optimization method. Furthermore, we fix a batch size of 200 and train for 500 epochs, not shuffling the data set during training so that each method sees the same realization of the noise. We use the mean-squared error as our loss function. We were unable to train the autoencoder using (35) with h = 1 since λ = 0.9 implies an effective learning rate of 10 for which the system blows up. Since deep neural networks are not strongly convex, there is no single optimal choice of µ; we simply set µ = 1 in our experiments. |