Mixed Dynamics In Linear Networks: Unifying the Lazy and Active Regimes
Authors: Zhenfeng Tu, Santiago Tomas Aranguri Diaz, Arthur Jacot
NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide a surprisingly simple unifying formula for the evolution of the learned matrix that contains as special cases both lazy and balanced regimes but also a mixed regime in between the two. In this paper, we study this transition in the context of linear networks and focus mainly on the effects of the width w and the variance of the weights at initialization σ2, and give a precise and almost complete phase diagram, showing the transitions between lazy and active regimes. Figure 1: For both plots, we train either using gradient descent or the self-consistent dynamics from equation (1), with the scaling γσ2 = 1.85, γw = 2.25 which lies in the active regime. (Left panel): We plot train and test error for both dynamics. Figure 2: As a function of γσ2, γw, we run GD and plot different quantities. |
| Researcher Affiliation | Academia | Zhenfeng Tu Courant Institute New York University New York, NY 10012 EMAIL Santiago Aranguri Courant Institute New York University New York, NY 10012 EMAIL Arthur Jacot Courant Institute New York University New York, NY 10012 EMAIL |
| Pseudocode | No | The paper provides mathematical formulas and derivations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We use synthetic data, with a description of how to build this synthetic data. The experiments are only there for visualization purposes, we see no particular need to publish it. |
| Open Datasets | No | For all the experiments, we used the losses Ltrain(θ) = 1 d2 Aθ (A + E) 2 F ; Ltest(θ) = 1 d2 Aθ A 2 F where E has i.i.d. N(0, 1) entries, A = K 1/2 PK i=1 uiv T i with ui, vi N(0, Idd) Gaussian vectors in Rd. This means that Rank A = K. The factor K 1/2 ensures that A F = Θ(d). |
| Dataset Splits | No | The paper mentions 'train and test error' and 'train error converged' but does not specify validation splits or proportions (e.g., 80/10/10 split or specific sample counts for validation). |
| Hardware Specification | Yes | Experiments took 12 hours of compute, using two Ge Force RTX 2080 Ti (11GB memory) and two TITAN V (12GB memory). |
| Software Dependencies | No | All the experiments were implemented in Py Torch [40]. |
| Experiment Setup | Yes | For the experiments in Figure 1, we took d = 500 and K = 5. For the experiments in Figure 2, we took d = 200 and K = 5. For making the contour plot, we took a grid with 35 points for γσ2 [ 3.0, 0.0] and 35 points for γw [0, 2.8]. For each of the 352 pair of values for (γσ2, γw), we ran gradient descent (and for the lower right plot the self-consistent dynamics too) until the train error converged. Following Theorem 2, we take a learning rate η = d2 cwσ2 for γσ2 +γ2 > 1, and η = d2 c A op otherwise, where c is usually 50 but can be taken to be 2 or 5 for faster convergence at the cost of more unstable training. |