Gradient Descent Learns Linear Dynamical Systems
Authors: Moritz Hardt, Tengyu Ma, Benjamin Recht
JMLR 2018 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we provide proof-of-concepts experiments on synthetic data. We will demonstrate that 1) plain SGD tends to blow up even with relatively small learning rate, especially on hard instances 2) SGD with our projection step converges with reasonably large learning rate, and with over-parameterization the final error is competitive 3) SGD with gradient clipping has the strongest performance in terms both of the convergence speed and the final error Our experiments suggest that the landscape of the objective function may be even nicer than what is predicted by our theoretical development. |
| Researcher Affiliation | Collaboration | Moritz Hardt EMAIL Department of Electrical Engineering and Computer Science University of California, Berkeley Tengyu Ma EMAIL Facebook AI Research Benjamin Recht EMAIL Department of Electrical Engineering and Computer Science University of California, Berkeley |
| Pseudocode | Yes | Algorithm 1 Projected stochastic gradient descent with partial loss Algorithm 2 Projected stochastic gradient descent for long sequences Algorithm 3 Back-propagation |
| Open Source Code | No | The paper does not contain any explicit statements about code availability, links to repositories, or mentions of code in supplementary materials. |
| Open Datasets | No | We generate the true system with state dimension d = 20 by randomly picking the conjugate pairs of roots of the characteristic polynomial inside the circle with radius ρ = 0.95 and randomly generating the vector C from standard normal distribution. The inputs of the dynamical model are generated from standard normal distribution with length T = 500. |
| Dataset Splits | No | The inputs of the dynamical model are generated from standard normal distribution with length T = 500. We note that we generate new fresh inputs and outputs at every iterations and therefore the training loss is equal to the test loss (in expectation.) |
| Hardware Specification | No | The paper describes the experimental setup in Section 8 "Simulations" but does not specify any hardware details like GPU/CPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper does not mention any specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers). |
| Experiment Setup | Yes | We use initial learning rate 0.01 in the projected gradient descent and SGD with gradient clipping. We use batch size 100 for all experiments, and decay the learning rate at 200K and 250K iteration by a factor of 10 in all experiments. |