Simplicity Bias and Optimization Threshold in Two-Layer ReLU Networks
Authors: Etienne Boursier, Nicolas Flammarion
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work illustrates on a simple linear example the phenomenon of non-convergence of the parameters towards a global minimum of the training loss, despite overparametrization. This non-convergence actually yields a simplicity bias on the final estimator, which can lead to an optimal fit of the true data distribution. A similar phenomenon has been observed on more complex and realistic settings (Yoon et al., 2023; Kadkhodaie et al., 2024; Raventós et al., 2024). However, a theoretical analysis remains out of reach in these cases. It is still unclear whether the observed non-convergence arises from the early alignment mechanism proposed in our work, from stability issues as suggested by Qiao et al. (2024), from other factors, or from a combination of these effects.Our result is proven via the description of the early alignment phase. Besides the specific data example considered in Section 4, we also provide concentration bounds on the extremal vectors driving this early alignment. We believe these bounds (Theorem 3.1) can be used in subsequent works to better understand this early phase of the training dynamics, and how it yields biases towards simple estimators. |
| Researcher Affiliation | Academia | INRIA, LMO, Université Paris-Saclay, Orsay, France 2TML Lab, EPFL, Switzerland. Correspondence to: Etienne Boursier <EMAIL>. |
| Pseudocode | No | The paper includes mathematical derivations and descriptions of algorithms (like gradient flow) but does not present any explicitly labeled 'Pseudocode' or 'Algorithm' block with structured steps. |
| Open Source Code | Yes | All the experiments were run on a personal Mac Book Pro, for a total compute time of approximately 100 hours. The code can be found at github.com/eboursier/simplicity_bias. |
| Open Datasets | No | The paper uses synthetic data generated according to a linear model: "yk = β xk + ηk, where ηk are drawn i.i.d. as centered Gaussian of variance σ2 = 0.09, xk are drawn i.i.d. as centered Gaussian variables and β is fixed, without loss of generality, to β = (1, 0, . . . , 0)." This data is generated by the authors, and no public access or repository is provided for specific instances of the generated datasets. |
| Dataset Splits | No | The paper describes generating training samples and evaluating train and test losses, but does not explicitly define fixed training, validation, and test splits from a pre-existing dataset. It varies the 'number of training samples' (n) but generates new data for each run. |
| Hardware Specification | Yes | All the experiments were run on a personal Mac Book Pro, for a total compute time of approximately 100 hours. |
| Software Dependencies | No | The paper mentions "pytorch default hyperparameters" in Appendix A.3 but does not specify a version number for PyTorch or any other software library. |
| Experiment Setup | Yes | The neural networks are trained via stochastic gradient descent (SGD), with batch size 32 and learning rate 0.01. To ensure that we reached convergence of the parameters, we train the networks for 8 × 10^6 iterations of SGD, where the training seems stabilized. |