When Attention Sink Emerges in Language Models: An Empirical View
Authors: Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, Min Lin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we first demonstrate that attention sinks exist universally in auto-regressive LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. ... We pre-train a series of LLa MA models to conduct our experiments, based on the repos (Zhang et al., 2024a; Liu et al., 2024a). |
| Researcher Affiliation | Collaboration | 1Sea AI Lab, Singapore 2National University of Singapore EMAIL; EMAIL |
| Pseudocode | No | The paper describes methods and processes in paragraph text and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/sail-sg/Attention-Sink. |
| Open Datasets | Yes | For data distribution, we sample 5B tokens from the Pile dataset (Gao et al., 2020). ... We leverage the platform (Gao et al., 2024) to evaluate the performance of open-sourced LMs, including LLa MA2/LLa MA3/OPT/Pythia/GPT2 families, on downstream LM task, e.g., Hella Swag (Zellers et al., 2019). ... Specifically, we utilize the platform1 to conduct our experiments. The experimental configurations include: the Ultra Chat dataset (about 200k training samples)2 (Ding et al., 2023) |
| Dataset Splits | Yes | We use the Pile-CC validation loss (Gao et al., 2020; Liu et al., 2024a) to measure the model performance and sample 100 sequences with T = 64 (no BOS token) out of training data to measure the metric Sinkϵ k with ϵ = 0.3. ... The experimental configurations include: the Ultra Chat dataset (about 200k training samples)2 (Ding et al., 2023), full-model fine-tuning, a learning rate of 2e-5 with cosine scheduling, batch size of 64, each of which contains 2048 tokens, one training epoch. |
| Hardware Specification | No | The paper states: "We pre-train a series of LLa MA models to conduct our experiments, based on the repos (Zhang et al., 2024a; Liu et al., 2024a)." It does not provide specific details about the hardware used (e.g., GPU models, CPU types, or cloud instances). |
| Software Dependencies | No | The paper mentions "Adam W (Loshchilov & Hutter, 2017)" as an optimizer and various model components like "Rotary (Su et al., 2024)", "RMSNorm (Zhang & Sennrich, 2019)", and "Swi GLU activation (Shazeer, 2020)". It also references a "platform" for experiments (huggingface/alignment-handbook in footnote). However, it does not provide specific version numbers for any software libraries, frameworks, or operating systems used. |
| Experiment Setup | Yes | For data distribution, we sample 5B tokens from the Pile dataset (Gao et al., 2020). We set the context length to 2048 tokens, the batch size to 1M tokens, and the training step to 20k (including 100 steps for warm-up). We adopt a learning rate of 4e-4 with cosine scheduling. The optimizer is Adam W (Loshchilov & Hutter, 2017) with a weight decay ratio of 0.1. ... The experimental configurations include: the Ultra Chat dataset (about 200k training samples)2 (Ding et al., 2023), full-model fine-tuning, a learning rate of 2e-5 with cosine scheduling, batch size of 64, each of which contains 2048 tokens, one training epoch. |