Selective Attention Improves Transformer

Authors: Yaniv Leviathan, Matan Kalman, Yossi Matias

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Selective attention consistently improves language modeling and downstream task performance in a variety of model sizes and context lengths. For example, transformers trained with the language modeling objective on C4 with selective attention perform language modeling equivalently to standard transformers with 2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention s context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity. In all of our experiments we use a decoder-only transformer with multi-head attention...We trained our models with the Adam W optimizer...For the language modeling experiments, we used the C4 (Raffel et al., 2023) dataset...
Researcher Affiliation Industry Yaniv Leviathan Google Research EMAIL Matan Kalman Google Research EMAIL Yossi Matias Google Research EMAIL
Pseudocode Yes Figure 2 illustrates a sketch implementation. ... Figure 2: A sketch implementation of selective attention. The colored lines are the additions to standard attention.
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes For the language modeling experiments, we used the C4 (Raffel et al., 2023) dataset... We repeated some of the experiments with a vocabulary of size 32K and observed similar results. We also ran experiments with Wiki Text (Merity et al., 2016), and lm1b (Chelba et al., 2014) and observed similar results.
Dataset Splits No The paper mentions training, validation, and test sets, for example, 'optimized the per-layer budgets on a training set and reported results on a separate unseen test set.' but does not specify the exact percentages or counts for these splits or reference a predefined split with a citation for the C4 dataset.
Hardware Specification Yes We trained all of our models on TPUv4s.
Software Dependencies No The paper mentions using specific components like 'Adam W optimizer' and 'Sentence Piece tokenizer' but does not provide version numbers for these or any other software dependencies such as deep learning frameworks or programming languages.
Experiment Setup Yes We trained our models with the Adam W optimizer with β1 = 0.9 and β2 = 0.999 for a total of 524,288 steps. We used cosine decay and 1,000 linear warmup steps and a learning rate of 0.005. We repeated some of the experiments with different learning rates and obtained similar results. We used a batch size of 256 and a fixed context size of 512 for all training runs except for the context size experiments (Figure 3 left) where we used a batch of 128.