Understanding Optimization in Deep Learning with Central Flows
Authors: Jeremy Cohen, Alex Damian, Ameet Talwalkar, Zico Kolter, Jason Lee
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that these central flows can predict long-term optimization trajectories for generic neural networks with a high degree of numerical accuracy. ... Although this derivation employs informal mathematical reasoning, our experiments demonstrate that this central flow can successfully predict long-term optimization trajectories on a variety of neural networks with a high degree of numerical accuracy. |
| Researcher Affiliation | Academia | 1 Carnegie Mellon University, 2 Princeton University, 3 Flatiron Institute |
| Pseudocode | No | The paper includes mathematical equations for optimizers (e.g., wt+1 = wt η L(wt). (GD)), and differential equations, but no explicitly labeled pseudocode blocks or algorithms with structured steps. |
| Open Source Code | Yes | Our code can be found at: http://github.com/locuslab/central_flows. |
| Open Datasets | Yes | We test the vision architectures (CNN, Res Net, Vi T) on a subset of CIFAR-10 (Krizhevsky, 2009). We test the sequence architectures (LSTM, Transformer, Mamba) on a synthetic sorting task (Karpathy, 2020). |
| Dataset Splits | No | We test the vision architectures on a subset of CIFAR-10 that contains 1000 training examples, all from the first 4 CIFAR-10 classes. ... The size of the training datraset was usually 1,000 (except for Mamba, where it was 250). The paper mentions the size of the training dataset but does not specify validation or test splits, nor does it refer to a standard split with sufficient detail. |
| Hardware Specification | No | The paper discusses the computational expense and time complexity of its methods but does not provide specific details about the hardware used, such as GPU models, CPU types, or cloud computing instance specifications. |
| Software Dependencies | No | The paper mentions software like Jax, PyTorch, and Python+NumPy programs. However, it does not provide specific version numbers for any of these software components, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | A Vi T is trained on CIFAR-10 using gradient descent with η = 2/200 (blue). ... Scalar RMSProp with η = 2/400 and β2 = 0.99 (blue). ... RMSProp with η = 2 * 10^-5 and β2 = 0.99 (blue). ... The quality of the central flow approximation is enhanced by starting it 10-15 steps into training. ... We initialize the weights and biases of the final linear layer to be zero, as this makes the curvature low at initialization. |