Quality over Quantity in Attention Layers: When Adding More Heads Hurts

Authors: Noah Amsel, Gilad Yehudai, Joan Bruna

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we prove that the rank can have a dramatic effect on the representational capacity of attention. This effect persists even when the number of heads and the parameter count are very large. Specifically, we present a simple and natural target function based on nearest neighbor search that can be represented using a single full-rank attention head for any sequence length. We prove that it cannot be approximated by a low-rank attention layer even on short sequences unless the number of heads is exponential in the embedding dimension. Thus, for this target function, rank is what determines an attention layer s power. We show that, for short sequences, using multiple layers allows the target to be approximated by low-rank attention; for long sequences, we conjecture that full-rank attention is necessary regardless of depth. Finally, we present experiments with standard multilayer transformers that validate our theoretical findings. They demonstrate that, because of how all standard transformer implementations set the rank, increasing the number of attention heads can severely decrease accuracy on certain tasks.
Researcher Affiliation Academia Noah Amsel1, Gilad Yehudai1 & Joan Bruna1,2 1Courant Institute of Mathematical Sciences, New York University 2Flatiron Institute EMAIL
Pseudocode No The paper includes extensive theoretical derivations and proofs, particularly in Appendices C, D, and E, which describe mathematical constructions and algorithms. However, these are presented as mathematical text and logical steps within the prose, rather than in explicitly structured pseudocode or algorithm blocks (e.g., labeled 'Algorithm 1', 'Pseudocode').
Open Source Code Yes Reproducibility Statement: Assumptions of all our theoretical results are described in the main text, and complete proofs are given in Appendices C to E. Details of all experiments are given in Appendix B, and our source code is included in the supplementary material and available at https: //github.com/Noah Amsel/attention-formers.
Open Datasets No Our dataset is synthetic, so we train and test on a stream of freshly generated samples that never repeat. We train on 105 batches of size 256 each. For all experiments, we use Adam W with the same learning rate of 0.01 and a learning rate schedule of cosine annealing with a linear warm-up.
Dataset Splits No Our dataset is synthetic, so we train and test on a stream of freshly generated samples that never repeat. This implies dynamic data generation rather than predefined static splits. The paper does not specify fixed training, validation, or test splits for any dataset.
Hardware Specification Yes We run each experiment on a single Nvidia GPU (usually a V100) for no more than a few hours.
Software Dependencies No We use the PyTorch implementation of transformer encoders (Paszke et al., 2019) with two modifications. While PyTorch is mentioned, a specific version number is not provided, nor are other software dependencies with their versions.
Experiment Setup Yes We train on 105 batches of size 256 each. For all experiments, we use Adam W with the same learning rate of 0.01 and a learning rate schedule of cosine annealing with a linear warm-up.