reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

From Attention to Activation: Unraveling the Enigmas of Large Language Models

Authors: Prannay Kaul, Chengcheng Ma, Ismail Elezi, Jiankang Deng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study two strange phenomena in auto-regressive Transformers: (1) the dominance of the ﬁrst token in attention heads; (2) the occurrence of large outlier activations in the hidden states. We ﬁnd that popular large language models, such as Llama attend maximally to the ﬁrst token in 98% of attention heads, a behaviour we attribute to the softmax function. To mitigate this issue, we propose a reformulation of softmax to softmax-1. Furthermore, we identify adaptive optimisers, e.g., Adam, as the primary contributor to the large outlier activations and introduce Ortho Adam, a novel optimiser that utilises orthogonal matrices to transform gradients, to address this issue. Finally, not only do our methods prevent these phenomena from occurring, but additionally, they enable Transformers to sustain their performance when quantised using basic algorithms, something that standard methods are unable to do. In summary, our methods reduce the attention proportion on the ﬁrst token from 65% to 3.3%, the activation kurtosis in the hidden states from 1657 to 3.1, and perplexity penalty under 4-bit weight quantisation from 3565 to 0.3. Code is available at https://github.com/prannaykaul/Ortho Adam.
Researcher Affiliation	Collaboration	1Huawei Noah s Ark Lab, London, UK 2Institute of Automation, Chinese Academy of Sciences (CASIA)
Pseudocode	Yes	Algorithm 1 Ortho Adam, our proposed optimiser for reducing activation outliers. g2 t is the element-wise square gt gt. With βt 1 and βt 2 we mean β1 and β2 taken to the power of t.
Open Source Code	Yes	Code is available at https://github.com/prannaykaul/Ortho Adam.
Open Datasets	Yes	Datasets. We train all models on the en training split of the C4 dataset (Dodge et al., 2021; Raffel et al., 2020) and evaluate on 100000 samples from the validation en split.
Dataset Splits	Yes	Datasets. We train all models on the en training split of the C4 dataset (Dodge et al., 2021; Raffel et al., 2020) and evaluate on 100000 samples from the validation en split.
Hardware Specification	Yes	We train models on 8 NVIDIA 32GB V100 GPUs using the Pytorch deep-learning framework (Paszke et al., 2019) and the Hugging Face Transformers library (Wolf et al., 2020).
Software Dependencies	No	We train models on 8 NVIDIA 32GB V100 GPUs using the Pytorch deep-learning framework (Paszke et al., 2019) and the Hugging Face Transformers library (Wolf et al., 2020). Specific version numbers for PyTorch and the Hugging Face Transformers library are not provided in the text.
Experiment Setup	Yes	Training. Unless stated otherwise, we use a batch size of 512 and a cosine learning rate schedule with linear warmup for {1000, 2000, 6000, 10000} steps for models with {60M, 130M, 350M, 1.4B} parameters respectively, with a maximum learning rate of 10 3. We train models with {60M, 130M, 350M, 1.4B} parameters for {160k, 320k, 960k, 600k} steps respectively.