Systematic Outliers in Large Language Models
Authors: Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We hypothesize and explore the mechanisms by which these outliers arise and function, demonstrating through theoretical derivations and experiments that they emerge due to the self-attention mechanism s softmax operation. These outliers act as implicit context-aware scaling factors within the attention mechanism. Our study not only enhances the understanding of Transformer-based LLMs but also shows that structurally eliminating outliers can accelerate convergence and improve model compression. The code is avilable at https://github.com/an-yongqi/systematic-outliers. In this part, we introduce five different attention formulations to explore the role of systematic outliers. We train five GPT-2 (Radford et al., 2019) models with these attention variants. |
| Researcher Affiliation | Collaboration | 1Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China 2School of artificial intelligence, University of Chinese Academy of Sciences, Beijing, China 3Wuhan AI Research, Wuhan, China, 4Objecteye Inc., Beijing, China |
| Pseudocode | No | The paper includes formulations for attention variants in Table 2, but these are mathematical formulations rather than structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is avilable at https://github.com/an-yongqi/systematic-outliers. |
| Open Datasets | Yes | Using 100 sequences of length 2,048 from the Red Pajama dataset (Computer, 2023), the positions where activation outliers first appear are identified. Both models were evaluated on the Wiki Text2 dataset using perplexity (PPL) as the primary metric, comparing GPT-2 Default and GPT-2 with context-aware scaling factors. |
| Dataset Splits | No | The paper mentions using "100 sequences of length 2,048 from the Red Pajama dataset" for analysis and evaluating on the "Wiki Text2 dataset" for performance. It also states "We utilize the open-source GPT-2 implementation from the Nano GPT repository (Karpathy, 2023), following the default recommended training setup and optimizer settings." While this implies standard settings, it does not explicitly provide specific dataset split information (e.g., percentages, sample counts for train/validation/test sets) within the paper itself. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using "the open-source GPT-2 implementation from the Nano GPT repository (Karpathy, 2023)" but does not specify any software names with version numbers (e.g., PyTorch version, Python version, CUDA version). |
| Experiment Setup | Yes | Each of the five GPT-2 variants was trained for 50,000 iterations, processing approximately 2 billion tokens in total. For the attention bias variant, we followed the initialization method proposed by (Sun et al., 2024), setting k and v to N(0, 0.02I). |