Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers
Authors: Tiberiu Mușat
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | I empirically show that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. [...] I uncover the learned mechanisms by studying the attention maps in the trained transformers. |
| Researcher Affiliation | Collaboration | Tiberiu Mus,at ETH Z urich, Switzerland Giotto.ai, Switzerland EMAIL |
| Pseudocode | No | The paper describes the methods and procedures in paragraph text and through diagrams (e.g., Figure 5, Figure 7) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code, nor does it include links to a code repository. The text describes the methodology and experiments, but does not offer concrete access to code. |
| Open Datasets | No | I test the large language models on 500 randomly generated questions for each formulation. [...] For each formulation, I train 64 transformers [...] 220 randomly generated training examples [...]. I train each transformer for 24k steps, a batch size of 128, and 262k randomly generated training examples (IC). |
| Dataset Splits | No | The paper mentions '220 randomly generated training examples' and '262k randomly generated training examples (IC)' for training and mentions 'validation loss' which implies a validation set, but it does not specify explicit percentages, sample counts, or the methodology used to split the data into training, validation, or test sets for reproducibility. For benchmarking LLMs, it states 'I test the large language models on 500 randomly generated questions for each formulation'. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, or memory specifications) used to run the experiments. |
| Software Dependencies | No | The paper mentions using the 'Adam optimizer (Kingma, 2014)' and 'layer normalization (Ba et al., 2016)', and following 'the recipe of Radford et al. (2019)', but does not provide specific version numbers for any programming languages, libraries, frameworks, or other software components. |
| Experiment Setup | Yes | For each formulation and number of layers, I train 8 transformers following the recipe of Radford et al. (2019). Each transformer has 8 attention heads per layer and residual streams of size 128. I train for 10k steps using the Adam optimizer (Kingma, 2014) with a learning rate of 10 3, decoupled weight decay of 0.1 (Loshchilov, 2017), a batch size of 512, 220 randomly generated training examples, layer normalization (Ba et al., 2016), no dropout, and mean squared error loss. |