Attention layers provably solve single-location regression

Authors: Pierre Marion, Raphaël Berthier, Gérard Biau, Claire Boyer

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study its theoretical properties, by showing its asymptotic Bayes optimality and analyzing its training dynamics. In particular, despite the non-convex nature of the problem, the predictor effectively learns the underlying structure. This work highlights the capacity of attention mechanisms to handle sparse token information and internal linear structures. Experimentally, we observe in Figure 4a that PGD is able to recover the oracle parameters (k , v ). The results of Theorem 5 are illustrated by Figure 4b. We observe that, due to roundoff errors, the dynamics are not exactly on the manifold but stay very close to the manifold. Our experiments (Figures 4a, 5, and Appendix E) yield encouraging results in all these directions.
Researcher Affiliation Academia Pierre Marion Institute of Mathematics EPFL Lausanne, Switzerland EMAIL Raphaël Berthier Sorbonne Université, Inria Centre Inria de Sorbonne Université Paris, France EMAIL Gérard Biau Sorbonne Université Institut Universitaire de France Paris, France EMAIL Claire Boyer Université Paris-Saclay Institut Universitaire de France Orsay, France
Pseudocode No The paper defines the Projected (Riemannian) Gradient Descent (PGD) in Definition 1, which outlines a recursive formula for updating parameters. While it describes a procedure with mathematical equations, it is not presented in a structured pseudocode or algorithm block format.
Open Source Code Yes Our code is available at https://github.com/Pierre Marion23/single-location-regression
Open Datasets No Data generation. We use synthetically-generated data for this experiment. To create our train set, we generate sentences according to the patterns... The size of the datasets is given in the table below.
Dataset Splits Yes The size of the datasets is given in the table below. Name Number of examples Train set 15552 Test set 4608 Test w. OOD tokens 3072 Test w. OOD structure 144 Test w. OOD structure+tokens 96
Hardware Specification No All experiments run in a short time (less than one hour) on a standard laptop.
Software Dependencies No We use the Transformers (Wolf et al., 2020) and scikit-learn (Pedregosa et al., 2011) libraries for the experiment of Section 2, and JAX (Bradbury et al., 2018) for the experiment of Section 5.
Experiment Setup Yes We train using single-pass stochastic gradient descent (meaning that fresh samples are used at each step), for 8,000 steps with a batch size of 128 and a learning rate of 0.01. The experiment is repeated 20 times with independent random initializations, and 95% percentile intervals are plotted (but are not visible when the variance is too small). Parameters K, V , W1, W2 are initialized with Gaussian entries of variance 2/(din + dout). The bias terms are initialized to 0, as well as the query matrix Q... The output weights θ are initialized with Gaussian entries of variance 1/d2... Parameters are L = 10, d = p = 80, m = 200, ε2 = 0.01, γ2 = 0.5. The following table summarizes the value of the parameters in our experiments. Name Figure 4a Figure 4b Figure 5 Figure 7a Figure 7b d 400 400 80 400 400 L 10 10 10 10 10 γ 1/2 λt 1/(1 + 10^4t) 0.1 2/(1 + 10^4t) 0.9 0.1 α 4 10^-3 4 10^-3 10^-3 10^-3 4 10^-3 Number of steps 120k 20k 200k 120k 20k N. of repetitions 30 30 30 30 30 Batch size - - 5 - - ε - - 0.1 - -