reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Quality over Quantity in Attention Layers: When Adding More Heads Hurts

Authors: Noah Amsel, Gilad Yehudai, Joan Bruna

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we prove that the rank can have a dramatic effect on the representational capacity of attention. This effect persists even when the number of heads and the parameter count are very large. Specifically, we present a simple and natural target function based on nearest neighbor search that can be represented using a single full-rank attention head for any sequence length. We prove that it cannot be approximated by a low-rank attention layer even on short sequences unless the number of heads is exponential in the embedding dimension. Thus, for this target function, rank is what determines an attention layer s power. We show that, for short sequences, using multiple layers allows the target to be approximated by low-rank attention; for long sequences, we conjecture that full-rank attention is necessary regardless of depth. Finally, we present experiments with standard multilayer transformers that validate our theoretical findings. They demonstrate that, because of how all standard transformer implementations set the rank, increasing the number of attention heads can severely decrease accuracy on certain tasks.
Researcher Affiliation	Academia	Noah Amsel1, Gilad Yehudai1 & Joan Bruna1,2 1Courant Institute of Mathematical Sciences, New York University 2Flatiron Institute EMAIL
Pseudocode	No	The paper includes extensive theoretical derivations and proofs, particularly in Appendices C, D, and E, which describe mathematical constructions and algorithms. However, these are presented as mathematical text and logical steps within the prose, rather than in explicitly structured pseudocode or algorithm blocks (e.g., labeled 'Algorithm 1', 'Pseudocode').
Open Source Code	Yes	Reproducibility Statement: Assumptions of all our theoretical results are described in the main text, and complete proofs are given in Appendices C to E. Details of all experiments are given in Appendix B, and our source code is included in the supplementary material and available at https: //github.com/Noah Amsel/attention-formers.
Open Datasets	No	Our dataset is synthetic, so we train and test on a stream of freshly generated samples that never repeat. We train on 105 batches of size 256 each. For all experiments, we use Adam W with the same learning rate of 0.01 and a learning rate schedule of cosine annealing with a linear warm-up.
Dataset Splits	No	Our dataset is synthetic, so we train and test on a stream of freshly generated samples that never repeat. This implies dynamic data generation rather than predefined static splits. The paper does not specify fixed training, validation, or test splits for any dataset.
Hardware Specification	Yes	We run each experiment on a single Nvidia GPU (usually a V100) for no more than a few hours.
Software Dependencies	No	We use the PyTorch implementation of transformer encoders (Paszke et al., 2019) with two modifications. While PyTorch is mentioned, a specific version number is not provided, nor are other software dependencies with their versions.
Experiment Setup	Yes	We train on 105 batches of size 256 each. For all experiments, we use Adam W with the same learning rate of 0.01 and a learning rate schedule of cosine annealing with a linear warm-up.