Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Authors: Tiberiu Mușat

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental I empirically show that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. [...] I uncover the learned mechanisms by studying the attention maps in the trained transformers.
Researcher Affiliation Collaboration Tiberiu Mus,at ETH Z urich, Switzerland Giotto.ai, Switzerland EMAIL
Pseudocode No The paper describes the methods and procedures in paragraph text and through diagrams (e.g., Figure 5, Figure 7) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code, nor does it include links to a code repository. The text describes the methodology and experiments, but does not offer concrete access to code.
Open Datasets No I test the large language models on 500 randomly generated questions for each formulation. [...] For each formulation, I train 64 transformers [...] 220 randomly generated training examples [...]. I train each transformer for 24k steps, a batch size of 128, and 262k randomly generated training examples (IC).
Dataset Splits No The paper mentions '220 randomly generated training examples' and '262k randomly generated training examples (IC)' for training and mentions 'validation loss' which implies a validation set, but it does not specify explicit percentages, sample counts, or the methodology used to split the data into training, validation, or test sets for reproducibility. For benchmarking LLMs, it states 'I test the large language models on 500 randomly generated questions for each formulation'.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, or memory specifications) used to run the experiments.
Software Dependencies No The paper mentions using the 'Adam optimizer (Kingma, 2014)' and 'layer normalization (Ba et al., 2016)', and following 'the recipe of Radford et al. (2019)', but does not provide specific version numbers for any programming languages, libraries, frameworks, or other software components.
Experiment Setup Yes For each formulation and number of layers, I train 8 transformers following the recipe of Radford et al. (2019). Each transformer has 8 attention heads per layer and residual streams of size 128. I train for 10k steps using the Adam optimizer (Kingma, 2014) with a learning rate of 10 3, decoupled weight decay of 0.1 (Loshchilov, 2017), a batch size of 512, 220 randomly generated training examples, layer normalization (Ba et al., 2016), no dropout, and mean squared error loss.