reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Authors: Tiberiu Mușat

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	I empirically show that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. [...] I uncover the learned mechanisms by studying the attention maps in the trained transformers.
Researcher Affiliation	Collaboration	Tiberiu Mus,at ETH Z urich, Switzerland Giotto.ai, Switzerland EMAIL
Pseudocode	No	The paper describes the methods and procedures in paragraph text and through diagrams (e.g., Figure 5, Figure 7) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code, nor does it include links to a code repository. The text describes the methodology and experiments, but does not offer concrete access to code.
Open Datasets	No	I test the large language models on 500 randomly generated questions for each formulation. [...] For each formulation, I train 64 transformers [...] 220 randomly generated training examples [...]. I train each transformer for 24k steps, a batch size of 128, and 262k randomly generated training examples (IC).
Dataset Splits	No	The paper mentions '220 randomly generated training examples' and '262k randomly generated training examples (IC)' for training and mentions 'validation loss' which implies a validation set, but it does not specify explicit percentages, sample counts, or the methodology used to split the data into training, validation, or test sets for reproducibility. For benchmarking LLMs, it states 'I test the large language models on 500 randomly generated questions for each formulation'.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, or memory specifications) used to run the experiments.
Software Dependencies	No	The paper mentions using the 'Adam optimizer (Kingma, 2014)' and 'layer normalization (Ba et al., 2016)', and following 'the recipe of Radford et al. (2019)', but does not provide specific version numbers for any programming languages, libraries, frameworks, or other software components.
Experiment Setup	Yes	For each formulation and number of layers, I train 8 transformers following the recipe of Radford et al. (2019). Each transformer has 8 attention heads per layer and residual streams of size 128. I train for 10k steps using the Adam optimizer (Kingma, 2014) with a learning rate of 10 3, decoupled weight decay of 0.1 (Loshchilov, 2017), a batch size of 512, 220 randomly generated training examples, layer normalization (Ba et al., 2016), no dropout, and mean squared error loss.