reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Systematic Outliers in Large Language Models

Authors: Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We hypothesize and explore the mechanisms by which these outliers arise and function, demonstrating through theoretical derivations and experiments that they emerge due to the self-attention mechanism s softmax operation. These outliers act as implicit context-aware scaling factors within the attention mechanism. Our study not only enhances the understanding of Transformer-based LLMs but also shows that structurally eliminating outliers can accelerate convergence and improve model compression. The code is avilable at https://github.com/an-yongqi/systematic-outliers. In this part, we introduce five different attention formulations to explore the role of systematic outliers. We train five GPT-2 (Radford et al., 2019) models with these attention variants.
Researcher Affiliation	Collaboration	1Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China 2School of artificial intelligence, University of Chinese Academy of Sciences, Beijing, China 3Wuhan AI Research, Wuhan, China, 4Objecteye Inc., Beijing, China
Pseudocode	No	The paper includes formulations for attention variants in Table 2, but these are mathematical formulations rather than structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code is avilable at https://github.com/an-yongqi/systematic-outliers.
Open Datasets	Yes	Using 100 sequences of length 2,048 from the Red Pajama dataset (Computer, 2023), the positions where activation outliers first appear are identified. Both models were evaluated on the Wiki Text2 dataset using perplexity (PPL) as the primary metric, comparing GPT-2 Default and GPT-2 with context-aware scaling factors.
Dataset Splits	No	The paper mentions using "100 sequences of length 2,048 from the Red Pajama dataset" for analysis and evaluating on the "Wiki Text2 dataset" for performance. It also states "We utilize the open-source GPT-2 implementation from the Nano GPT repository (Karpathy, 2023), following the default recommended training setup and optimizer settings." While this implies standard settings, it does not explicitly provide specific dataset split information (e.g., percentages, sample counts for train/validation/test sets) within the paper itself.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions using "the open-source GPT-2 implementation from the Nano GPT repository (Karpathy, 2023)" but does not specify any software names with version numbers (e.g., PyTorch version, Python version, CUDA version).
Experiment Setup	Yes	Each of the five GPT-2 variants was trained for 50,000 iterations, processing approximately 2 billion tokens in total. For the attention bias variant, we followed the initialization method proposed by (Sun et al., 2024), setting k and v to N(0, 0.02I).