Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

Authors: Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, Yongfeng Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (i.e., knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model s parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis.
Researcher Affiliation Academia 1Rutgers University 2Carnegie Mellon University 3New Jersey Institute of Technology 4University of Minnesota. Correspondence to: Mingyu Jin <EMAIL>, Zirui Liu <EMAIL>, Yongfeng Zhang <EMAIL>.
Pseudocode No The paper describes the methods and processes in narrative text and mathematical formulas, but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/Mingyu J666/Rope_with_LLM.
Open Datasets Yes For Contextual Knowledge Understanding Tasks, we adopt mathematical reasoning benchmarks (i.e., GSM-8K (Cobbe et al., 2021), AQUA (Ling et al., 2017)), sentiment analysis dataset (i.e., IMDB (Maas et al., 2011)), and synthetic passkey retrieval datasets as Appendix A in different difficulty levels. For Parametric Knowledge Retrieval Tasks, we adopt factual knowledge QA such as Cities (Marks & Tegmark, 2023) and our synthetic datasets covering topics in Sports, Arts, Technology, and Celebrity. The dataset details and our data synthesis pipeline can be found at Appendix E and Appendix A.
Dataset Splits No For our experiments, we selected 1,000 samples from the dataset and instructed the LLM to classify each review as either positive or negative sentiment based on a provided system prompt like Figure 38. In our experiments, we used the first 1,000 samples from the training set of GSM8K. For the synthetic datasets, 'After the human evaluation, we sample 200 examples for each category to construct the final synthetic dataset.' While specific subsets are used, explicit train/test/validation splits for these samples within the authors' experimental methodology are not provided.
Hardware Specification No During inference, we consistently used flash_attention_2 (Dao et al., 2022) for faster inference speeds. However, the paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No For the main table as Table 1 in the body text, we used three classic models with Ro PE: Llama-3-8B-Instruct (Dubey et al., 2024), google/gemma-2-9b-it (Team et al., 2024), and Qwen2.5-7B-Instruct (Yang et al., 2024). During inference, we consistently used flash_attention_2 (Dao et al., 2022) for faster inference speeds. No specific version numbers for software libraries (e.g., Python, PyTorch) are provided.
Experiment Setup Yes Definition 1. (Massive Value) A massive value is an element Mh,d that satisfies: d =1 Mh,d (8) where λ > 1 is a threshold controlling massive value selection. In our experiments, we empirically set λ = 5. Our investigation reveals that disrupting massive values can be accomplished through several substitution methods: using mean values, zeros, maxima, or minima. We disrupt massive/non-massive values on both Q and K like Table 1. As shown in Figure 3c, we systematically varied n from 1 to 20 (represented on the horizontal axis) in our control experiments.