reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

When Attention Sink Emerges in Language Models: An Empirical View

Authors: Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, Min Lin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we first demonstrate that attention sinks exist universally in auto-regressive LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. ... We pre-train a series of LLa MA models to conduct our experiments, based on the repos (Zhang et al., 2024a; Liu et al., 2024a).
Researcher Affiliation	Collaboration	1Sea AI Lab, Singapore 2National University of Singapore EMAIL; EMAIL
Pseudocode	No	The paper describes methods and processes in paragraph text and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/sail-sg/Attention-Sink.
Open Datasets	Yes	For data distribution, we sample 5B tokens from the Pile dataset (Gao et al., 2020). ... We leverage the platform (Gao et al., 2024) to evaluate the performance of open-sourced LMs, including LLa MA2/LLa MA3/OPT/Pythia/GPT2 families, on downstream LM task, e.g., Hella Swag (Zellers et al., 2019). ... Specifically, we utilize the platform1 to conduct our experiments. The experimental configurations include: the Ultra Chat dataset (about 200k training samples)2 (Ding et al., 2023)
Dataset Splits	Yes	We use the Pile-CC validation loss (Gao et al., 2020; Liu et al., 2024a) to measure the model performance and sample 100 sequences with T = 64 (no BOS token) out of training data to measure the metric Sinkϵ k with ϵ = 0.3. ... The experimental configurations include: the Ultra Chat dataset (about 200k training samples)2 (Ding et al., 2023), full-model fine-tuning, a learning rate of 2e-5 with cosine scheduling, batch size of 64, each of which contains 2048 tokens, one training epoch.
Hardware Specification	No	The paper states: "We pre-train a series of LLa MA models to conduct our experiments, based on the repos (Zhang et al., 2024a; Liu et al., 2024a)." It does not provide specific details about the hardware used (e.g., GPU models, CPU types, or cloud instances).
Software Dependencies	No	The paper mentions "Adam W (Loshchilov & Hutter, 2017)" as an optimizer and various model components like "Rotary (Su et al., 2024)", "RMSNorm (Zhang & Sennrich, 2019)", and "Swi GLU activation (Shazeer, 2020)". It also references a "platform" for experiments (huggingface/alignment-handbook in footnote). However, it does not provide specific version numbers for any software libraries, frameworks, or operating systems used.
Experiment Setup	Yes	For data distribution, we sample 5B tokens from the Pile dataset (Gao et al., 2020). We set the context length to 2048 tokens, the batch size to 1M tokens, and the training step to 20k (including 100 steps for warm-up). We adopt a learning rate of 4e-4 with cosine scheduling. The optimizer is Adam W (Loshchilov & Hutter, 2017) with a weight decay ratio of 0.1. ... The experimental configurations include: the Ultra Chat dataset (about 200k training samples)2 (Ding et al., 2023), full-model fine-tuning, a learning rate of 2e-5 with cosine scheduling, batch size of 64, each of which contains 2048 tokens, one training epoch.