Consensus Is All You Get: The Role of Attention in Transformers

Authors: Álvaro Rodrı́guez Abella, João Pedro Silvestre, Paulo Tabuada

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our findings are carefully compared with existing theoretical results and illustrated by simulations and experimental studies using the GPT-2 and the GPT-Neo models. 5. Simulations and empirical validation In this section we illustrate the theoretical results and show that their conclusions appear to hold even when our assumptions are violated. We start by simulating the continuous transformer model and illustrating our theoretical results. In addition to simulations, we provide empirical evidence using the GPT-2 and the GPT-Neo models to show how token consensus seems to occur even if the assumptions in our theoretical results are not satisfied.
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, University of California, Los Angeles, USA. Correspondence to: Jo ao Pedro Silvestre <EMAIL>.
Pseudocode No The paper focuses on mathematical models and theoretical proofs, complemented by simulations and empirical validation. It does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes In the first experiments, we removed the feedfoward layers of each model, to make them closer to the structure we assume in our theoretical work. The experiments were then repeated without removing the feedforward layers, showing that in both cases convergence to consensus occurs. (...) the code used for our experiments is available at (Silvestre, 2025).
Open Datasets Yes Our theoretical findings are validated by simulations of the mathematical model for attention. Moreover, experiments with the GPT-2 and the GPT-Neo models provide empirical evidence for convergence to consensus equilibria in more general situations than those captured by our theoretical results, thus providing additional confirmation for model collapse. (...) experiments conducted on the GPT-2 XL model and the GPT-Neo 2.7B model (...) we used the same set of 100 random prompts, each generated by uniformly sampling 200 tokens from the GPT-2 tokenizer s vocabulary.
Dataset Splits No The paper describes generating 100 random prompts by sampling tokens from the GPT-2 tokenizer's vocabulary. It does not specify any training/test/validation splits for these generated prompts or any other dataset.
Hardware Specification No The paper does not explicitly describe the hardware used to run its simulations or experiments with GPT-2 and GPT-Neo models.
Software Dependencies No The paper mentions using the "Hugging Face library (Wolf et al., 2020)" but does not provide specific version numbers for this or any other software dependency.
Experiment Setup Yes We simulate the motion of 10 tokens, each of them randomly placed on the sphere S2 R3, according to the dynamics (6) with h = 2. All matrices, except for P1(t) and P2(t), were randomly chosen, and each element was drawn from a uniform distribution in the interval [ 0.5, 0.5]. (...) The matrices D1(t) and D2(t) were given by D1(t) = 2 diag cos(10πt), sin(10πt), cos(6πt) and D2(t) = 2 diag cos(6πt), sin(6πt), cos(4πt) . (...) We now consider the auto-regressive model with 50 tokens on S499 R500. The number and dimension of the tokens were chosen to make them comparable to the GPT2 model. We use two heads (h = 2). (...) For all our experiments, we used the same set of 100 random prompts, each generated by uniformly sampling 200 tokens from the GPT-2 tokenizer s vocabulary. In each experiment, we plot the average of E across all the prompts.