reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Consensus Is All You Get: The Role of Attention in Transformers

Authors: Álvaro Rodrı́guez Abella, João Pedro Silvestre, Paulo Tabuada

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our findings are carefully compared with existing theoretical results and illustrated by simulations and experimental studies using the GPT-2 and the GPT-Neo models. 5. Simulations and empirical validation In this section we illustrate the theoretical results and show that their conclusions appear to hold even when our assumptions are violated. We start by simulating the continuous transformer model and illustrating our theoretical results. In addition to simulations, we provide empirical evidence using the GPT-2 and the GPT-Neo models to show how token consensus seems to occur even if the assumptions in our theoretical results are not satisfied.
Researcher Affiliation	Academia	1Department of Electrical and Computer Engineering, University of California, Los Angeles, USA. Correspondence to: Jo ao Pedro Silvestre <EMAIL>.
Pseudocode	No	The paper focuses on mathematical models and theoretical proofs, complemented by simulations and empirical validation. It does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	In the first experiments, we removed the feedfoward layers of each model, to make them closer to the structure we assume in our theoretical work. The experiments were then repeated without removing the feedforward layers, showing that in both cases convergence to consensus occurs. (...) the code used for our experiments is available at (Silvestre, 2025).
Open Datasets	Yes	Our theoretical findings are validated by simulations of the mathematical model for attention. Moreover, experiments with the GPT-2 and the GPT-Neo models provide empirical evidence for convergence to consensus equilibria in more general situations than those captured by our theoretical results, thus providing additional confirmation for model collapse. (...) experiments conducted on the GPT-2 XL model and the GPT-Neo 2.7B model (...) we used the same set of 100 random prompts, each generated by uniformly sampling 200 tokens from the GPT-2 tokenizer s vocabulary.
Dataset Splits	No	The paper describes generating 100 random prompts by sampling tokens from the GPT-2 tokenizer's vocabulary. It does not specify any training/test/validation splits for these generated prompts or any other dataset.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its simulations or experiments with GPT-2 and GPT-Neo models.
Software Dependencies	No	The paper mentions using the "Hugging Face library (Wolf et al., 2020)" but does not provide specific version numbers for this or any other software dependency.
Experiment Setup	Yes	We simulate the motion of 10 tokens, each of them randomly placed on the sphere S2 R3, according to the dynamics (6) with h = 2. All matrices, except for P1(t) and P2(t), were randomly chosen, and each element was drawn from a uniform distribution in the interval [ 0.5, 0.5]. (...) The matrices D1(t) and D2(t) were given by D1(t) = 2 diag cos(10πt), sin(10πt), cos(6πt) and D2(t) = 2 diag cos(6πt), sin(6πt), cos(4πt) . (...) We now consider the auto-regressive model with 50 tokens on S499 R500. The number and dimension of the tokens were chosen to make them comparable to the GPT2 model. We use two heads (h = 2). (...) For all our experiments, we used the same set of 100 random prompts, each generated by uniformly sampling 200 tokens from the GPT-2 tokenizer s vocabulary. In each experiment, we plot the average of E across all the prompts.