Early Alignment in Two-Layer Networks Training is a Two-Edged Sword

Authors: Etienne Boursier, Nicolas Flammarion

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section empirically illustrates the results of Theorems 1 and 2. The considered dataset does not exactly fit the conditions of Theorem 2 to illustrate that Assumption 3 with η < 1/6 is only needed for analytical purposes. The dataset is however similar to datasets satisfying Assumption 3 (see e.g., Figure 1) in the sense that all three data points are positively correlated, with positive labels; and the middle point is below the optimal linear regressor. [...] Figure 2 illustrates the training dynamics over time.
Researcher Affiliation Academia Etienne Boursier EMAIL Universit e Paris-Saclay, CNRS, Inria, Laboratoire de math ematiques d Orsay, 91405, Orsay, France; Nicolas Flammarion EMAIL TML Lab, EPFL, Switzerland
Pseudocode No The paper describes methods and theoretical analysis through mathematical equations and proofs. It mentions a 'proof sketch' but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes The code and animated versions of the figures are also available at github.com/eboursier/early_alignment.
Open Datasets No We consider the following 3 points data example (n = 3 in this section). Assumption 3. The data is given by 3 points (xk, yk) R3, for some η > 0, x1 ( 1, 1 + η] [1, 1 + η] and y1 [1, 1 + η]; x2 [ η, η] [1 η, 1 + η] and y2 (0, η]; x3 [1 η, 1) [1, 1 + η] and y3 [1, 1 + η]. [...] In Section 5, we considered the following univariate 3 points dataset: x1 = 0.75 and y1 = 1.1; x2 = 0.5 and y2 = 0.1; x3 = 0.125 and y3 = 0.8. [...] Precisely, we consider 40 univariate data points xi sampled uniformly at random in [ 1, 1].
Dataset Splits No The paper uses custom-generated simple data examples to illustrate theoretical results and training dynamics. It does not perform evaluations requiring standard training, validation, or test splits, and therefore, no such split information is provided.
Hardware Specification No The paper does not provide specific hardware details (like exact GPU models, CPU types, or memory configurations) used for running its experiments.
Software Dependencies No The paper mentions 'Re LU network with gradient descent' and 'trained with m = 200 000 neurons' using a 'learning rate 10^-3', but it does not specify any particular software libraries, frameworks, or their version numbers (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup Yes The activation function is Re LU, the initialisation follows Equation (3) with λ = 10^-3 and wj N(0, I2), aj = sj wj with sj U({ -1, 1}). [...] Lastly, the neural network is trained with m = 200 000 neurons to approximate the infinite width regime. We ran gradient descent with learning rate 10^-3 up to 2 millions of iterations.