On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning
Authors: Thomas Tck Zhang, Behrad Moniri, Ansh Nagwekar, Faraz Rahman, Anton Xue, Hamed Hassani, Nikolai Matni
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. [...] Lastly, we carefully numerically verify our theoretical predictions. Notably, we confirm the findings in Benzing (2022) that full second-order methods heavily underperform KFAC in convergence rate and stability. We also show standard tools like Adam-like preconditioning and batch-norm (Ioffe & Szegedy, 2015) do not fix the issues we identify, even for our simple models, and may even hurt generalization in the latter s case. |
| Researcher Affiliation | Academia | 1University of Pennsylvania. Correspondence to: T. Zhang, B. Moniri <EMAIL>. |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided in the paper. While algorithmic steps are described mathematically (e.g., equations 5 and 8), they are not presented in a structured pseudocode format. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | No | The paper describes data generation processes for its experiments, for example: "Our data generation process for the training task and the transfer task are as follows: ys i = Fs G xs i + εs i, xs i i.i.d. Σ1/2 x,s Unif({ 1}d X), εs i i.i.d. N(0, σ2 ε,s Id Y), s {test, train}". It uses synthetic data generated according to specified distributions rather than pre-existing public datasets, and no access to the generated data is provided. |
| Dataset Splits | No | The paper describes generating data for training and transfer tasks separately, but does not provide explicit training/validation/test splits (e.g., percentages or counts) of a single static dataset. The data is generated on-the-fly for specific tasks. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments (e.g., CPU, GPU models, or cloud computing instances). |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation of the experiments. |
| Experiment Setup | Yes | Our data generation process for the training task and the transfer task are as follows: ... We use d X = 100, d Y = 15, k = 8, and batch size n = 1024. ... We use the same learning rate 10 2 for each optimizer except for NGD, in which we used 10 4. The batch size is 1024. ... In this experiment, we set d X = 200, n = 6000, and dh = 1000 and set λG 0. ... We set d X = 900, n = 5000, dh = 1000, and Σx = Σ(0.5) x . |