Quality over Quantity in Attention Layers: When Adding More Heads Hurts
Authors: Noah Amsel, Gilad Yehudai, Joan Bruna
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we prove that the rank can have a dramatic effect on the representational capacity of attention. This effect persists even when the number of heads and the parameter count are very large. Specifically, we present a simple and natural target function based on nearest neighbor search that can be represented using a single full-rank attention head for any sequence length. We prove that it cannot be approximated by a low-rank attention layer even on short sequences unless the number of heads is exponential in the embedding dimension. Thus, for this target function, rank is what determines an attention layer s power. We show that, for short sequences, using multiple layers allows the target to be approximated by low-rank attention; for long sequences, we conjecture that full-rank attention is necessary regardless of depth. Finally, we present experiments with standard multilayer transformers that validate our theoretical findings. They demonstrate that, because of how all standard transformer implementations set the rank, increasing the number of attention heads can severely decrease accuracy on certain tasks. |
| Researcher Affiliation | Academia | Noah Amsel1, Gilad Yehudai1 & Joan Bruna1,2 1Courant Institute of Mathematical Sciences, New York University 2Flatiron Institute EMAIL |
| Pseudocode | No | The paper includes extensive theoretical derivations and proofs, particularly in Appendices C, D, and E, which describe mathematical constructions and algorithms. However, these are presented as mathematical text and logical steps within the prose, rather than in explicitly structured pseudocode or algorithm blocks (e.g., labeled 'Algorithm 1', 'Pseudocode'). |
| Open Source Code | Yes | Reproducibility Statement: Assumptions of all our theoretical results are described in the main text, and complete proofs are given in Appendices C to E. Details of all experiments are given in Appendix B, and our source code is included in the supplementary material and available at https: //github.com/Noah Amsel/attention-formers. |
| Open Datasets | No | Our dataset is synthetic, so we train and test on a stream of freshly generated samples that never repeat. We train on 105 batches of size 256 each. For all experiments, we use Adam W with the same learning rate of 0.01 and a learning rate schedule of cosine annealing with a linear warm-up. |
| Dataset Splits | No | Our dataset is synthetic, so we train and test on a stream of freshly generated samples that never repeat. This implies dynamic data generation rather than predefined static splits. The paper does not specify fixed training, validation, or test splits for any dataset. |
| Hardware Specification | Yes | We run each experiment on a single Nvidia GPU (usually a V100) for no more than a few hours. |
| Software Dependencies | No | We use the PyTorch implementation of transformer encoders (Paszke et al., 2019) with two modifications. While PyTorch is mentioned, a specific version number is not provided, nor are other software dependencies with their versions. |
| Experiment Setup | Yes | We train on 105 batches of size 256 each. For all experiments, we use Adam W with the same learning rate of 0.01 and a learning rate schedule of cosine annealing with a linear warm-up. |