Nesterov acceleration in benignly non-convex landscapes
Authors: Kanan Gupta, Stephan Wojtowytsch
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate in Figure 3 that our assumptions are locally reasonable in deep learning. We trained a fully connected neural network (with 10 layers, width 35, tanh activation) to fit labels yi at 100 randomly generated datapoints xi R12. The small dataset size allowed us to use the exact gradient and loss function instead of stochastic approximations, for a better exploration of the loss landscape. Since the closest minimizer is generally unknown, we use the gradient as a proxy and examine the convexity of ϕ(t) = L(w + tg) for w very close to the set of global minimizers of the loss function L as in (2) and g = L(w)/ L(w) . Labels were generated using a randomly initialized teacher network (with 7 layers and width 20). Student networks were trained for 10,000 epochs using stochastic gradient descent with Nesterov momentum, with learning rate η = 0.005 and momentum ρ = 0.99. Final training loss ranged between 10 12 and 10 9 across the five runs. Second derivatives were approximated using second order difference quotients ϕ (t) ϕ(t+h) 2ϕ(t)+ϕ(t h) h2 for h = 0.01. Similarly, the strong aiming parameter with respect to the global minimizer was estimated by 2 ϕ (t)t ϕ(t)+inf ϕ t2 where ϕ (t) was estimated as ϕ(t+h) ϕ(t h). |
| Researcher Affiliation | Academia | Kanan Gupta, Stephan Wojtowytsh Department of Mathematics, University of Pittsburgh EMAIL, EMAIL |
| Pseudocode | No | The paper describes algorithms such as the time-stepping scheme (5) and the AGNES scheme (10) using mathematical equations, but it does not present them in a structured 'Pseudocode' or 'Algorithm' block format. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available. |
| Open Datasets | No | The paper states, "We trained a fully connected neural network... to fit labels yi at 100 randomly generated datapoints xi R12. ... Labels were generated using a randomly initialized teacher network." This indicates the use of synthetically generated data without providing specific access information for a publicly available dataset. |
| Dataset Splits | No | The paper mentions using "100 randomly generated datapoints" but does not provide any specific information regarding training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions training a neural network. |
| Software Dependencies | No | The paper mentions using "stochastic gradient descent with Nesterov momentum" and "tanh activation" but does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow) for the implementation. |
| Experiment Setup | Yes | Student networks were trained for 10,000 epochs using stochastic gradient descent with Nesterov momentum, with learning rate η = 0.005 and momentum ρ = 0.99. ... Second derivatives were approximated using second order difference quotients ϕ (t) ϕ(t+h) 2ϕ(t)+ϕ(t h) h2 for h = 0.01. Similarly, the strong aiming parameter with respect to the global minimizer was estimated by 2 ϕ (t)t ϕ(t)+inf ϕ t2 where ϕ (t) was estimated as ϕ(t+h) ϕ(t h). |