Early Alignment in Two-Layer Networks Training is a Two-Edged Sword
Authors: Etienne Boursier, Nicolas Flammarion
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section empirically illustrates the results of Theorems 1 and 2. The considered dataset does not exactly fit the conditions of Theorem 2 to illustrate that Assumption 3 with η < 1/6 is only needed for analytical purposes. The dataset is however similar to datasets satisfying Assumption 3 (see e.g., Figure 1) in the sense that all three data points are positively correlated, with positive labels; and the middle point is below the optimal linear regressor. [...] Figure 2 illustrates the training dynamics over time. |
| Researcher Affiliation | Academia | Etienne Boursier EMAIL Universit e Paris-Saclay, CNRS, Inria, Laboratoire de math ematiques d Orsay, 91405, Orsay, France; Nicolas Flammarion EMAIL TML Lab, EPFL, Switzerland |
| Pseudocode | No | The paper describes methods and theoretical analysis through mathematical equations and proofs. It mentions a 'proof sketch' but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | Yes | The code and animated versions of the figures are also available at github.com/eboursier/early_alignment. |
| Open Datasets | No | We consider the following 3 points data example (n = 3 in this section). Assumption 3. The data is given by 3 points (xk, yk) R3, for some η > 0, x1 ( 1, 1 + η] [1, 1 + η] and y1 [1, 1 + η]; x2 [ η, η] [1 η, 1 + η] and y2 (0, η]; x3 [1 η, 1) [1, 1 + η] and y3 [1, 1 + η]. [...] In Section 5, we considered the following univariate 3 points dataset: x1 = 0.75 and y1 = 1.1; x2 = 0.5 and y2 = 0.1; x3 = 0.125 and y3 = 0.8. [...] Precisely, we consider 40 univariate data points xi sampled uniformly at random in [ 1, 1]. |
| Dataset Splits | No | The paper uses custom-generated simple data examples to illustrate theoretical results and training dynamics. It does not perform evaluations requiring standard training, validation, or test splits, and therefore, no such split information is provided. |
| Hardware Specification | No | The paper does not provide specific hardware details (like exact GPU models, CPU types, or memory configurations) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Re LU network with gradient descent' and 'trained with m = 200 000 neurons' using a 'learning rate 10^-3', but it does not specify any particular software libraries, frameworks, or their version numbers (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | The activation function is Re LU, the initialisation follows Equation (3) with λ = 10^-3 and wj N(0, I2), aj = sj wj with sj U({ -1, 1}). [...] Lastly, the neural network is trained with m = 200 000 neurons to approximate the infinite width regime. We ran gradient descent with learning rate 10^-3 up to 2 millions of iterations. |