reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Attention layers provably solve single-location regression

Authors: Pierre Marion, Raphaël Berthier, Gérard Biau, Claire Boyer

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study its theoretical properties, by showing its asymptotic Bayes optimality and analyzing its training dynamics. In particular, despite the non-convex nature of the problem, the predictor effectively learns the underlying structure. This work highlights the capacity of attention mechanisms to handle sparse token information and internal linear structures. Experimentally, we observe in Figure 4a that PGD is able to recover the oracle parameters (k , v ). The results of Theorem 5 are illustrated by Figure 4b. We observe that, due to roundoff errors, the dynamics are not exactly on the manifold but stay very close to the manifold. Our experiments (Figures 4a, 5, and Appendix E) yield encouraging results in all these directions.
Researcher Affiliation	Academia	Pierre Marion Institute of Mathematics EPFL Lausanne, Switzerland EMAIL Raphaël Berthier Sorbonne Université, Inria Centre Inria de Sorbonne Université Paris, France EMAIL Gérard Biau Sorbonne Université Institut Universitaire de France Paris, France EMAIL Claire Boyer Université Paris-Saclay Institut Universitaire de France Orsay, France
Pseudocode	No	The paper defines the Projected (Riemannian) Gradient Descent (PGD) in Definition 1, which outlines a recursive formula for updating parameters. While it describes a procedure with mathematical equations, it is not presented in a structured pseudocode or algorithm block format.
Open Source Code	Yes	Our code is available at https://github.com/Pierre Marion23/single-location-regression
Open Datasets	No	Data generation. We use synthetically-generated data for this experiment. To create our train set, we generate sentences according to the patterns... The size of the datasets is given in the table below.
Dataset Splits	Yes	The size of the datasets is given in the table below. Name Number of examples Train set 15552 Test set 4608 Test w. OOD tokens 3072 Test w. OOD structure 144 Test w. OOD structure+tokens 96
Hardware Specification	No	All experiments run in a short time (less than one hour) on a standard laptop.
Software Dependencies	No	We use the Transformers (Wolf et al., 2020) and scikit-learn (Pedregosa et al., 2011) libraries for the experiment of Section 2, and JAX (Bradbury et al., 2018) for the experiment of Section 5.
Experiment Setup	Yes	We train using single-pass stochastic gradient descent (meaning that fresh samples are used at each step), for 8,000 steps with a batch size of 128 and a learning rate of 0.01. The experiment is repeated 20 times with independent random initializations, and 95% percentile intervals are plotted (but are not visible when the variance is too small). Parameters K, V , W1, W2 are initialized with Gaussian entries of variance 2/(din + dout). The bias terms are initialized to 0, as well as the query matrix Q... The output weights θ are initialized with Gaussian entries of variance 1/d2... Parameters are L = 10, d = p = 80, m = 200, ε2 = 0.01, γ2 = 0.5. The following table summarizes the value of the parameters in our experiments. Name Figure 4a Figure 4b Figure 5 Figure 7a Figure 7b d 400 400 80 400 400 L 10 10 10 10 10 γ 1/2 λt 1/(1 + 10^4t) 0.1 2/(1 + 10^4t) 0.9 0.1 α 4 10^-3 4 10^-3 10^-3 10^-3 4 10^-3 Number of steps 120k 20k 200k 120k 20k N. of repetitions 30 30 30 30 30 Batch size - - 5 - - ε - - 0.1 - -