Gradient Descent Learns Linear Dynamical Systems

Authors: Moritz Hardt, Tengyu Ma, Benjamin Recht

JMLR 2018 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we provide proof-of-concepts experiments on synthetic data. We will demonstrate that 1) plain SGD tends to blow up even with relatively small learning rate, especially on hard instances 2) SGD with our projection step converges with reasonably large learning rate, and with over-parameterization the final error is competitive 3) SGD with gradient clipping has the strongest performance in terms both of the convergence speed and the final error Our experiments suggest that the landscape of the objective function may be even nicer than what is predicted by our theoretical development.
Researcher Affiliation Collaboration Moritz Hardt EMAIL Department of Electrical Engineering and Computer Science University of California, Berkeley Tengyu Ma EMAIL Facebook AI Research Benjamin Recht EMAIL Department of Electrical Engineering and Computer Science University of California, Berkeley
Pseudocode Yes Algorithm 1 Projected stochastic gradient descent with partial loss Algorithm 2 Projected stochastic gradient descent for long sequences Algorithm 3 Back-propagation
Open Source Code No The paper does not contain any explicit statements about code availability, links to repositories, or mentions of code in supplementary materials.
Open Datasets No We generate the true system with state dimension d = 20 by randomly picking the conjugate pairs of roots of the characteristic polynomial inside the circle with radius ρ = 0.95 and randomly generating the vector C from standard normal distribution. The inputs of the dynamical model are generated from standard normal distribution with length T = 500.
Dataset Splits No The inputs of the dynamical model are generated from standard normal distribution with length T = 500. We note that we generate new fresh inputs and outputs at every iterations and therefore the training loss is equal to the test loss (in expectation.)
Hardware Specification No The paper describes the experimental setup in Section 8 "Simulations" but does not specify any hardware details like GPU/CPU models, memory, or cloud instance types.
Software Dependencies No The paper does not mention any specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers).
Experiment Setup Yes We use initial learning rate 0.01 in the projected gradient descent and SGD with gradient clipping. We use batch size 100 for all experiments, and decay the learning rate at 200K and 250K iteration by a factor of 10 in all experiments.