Retrieval Head Mechanistically Explains Long-Context Factuality
Authors: Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information (either copy-paste or paraphrase), which we dub retrieval heads. We identify intriguing properties of retrieval heads: (1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information retrieval. (4) dynamically activated: take Llama-2 7B for example, 12 retrieval heads always attend to the required information no matter how the context is changed. The rest of the retrieval heads are activated in different contexts. (5) causal: completely pruning retrieval heads leads to failure in retrieving relevant information and results in hallucination, while pruning random non-retrieval heads does not affect the model s retrieval ability. We further show that retrieval heads strongly influence chain-of-thought (Co T) reasoning, where the model needs to frequently refer back the question and previously-generated context. Conversely, tasks where the model directly generates the answer using its intrinsic knowledge are less impacted by masking out retrieval heads. These observations collectively explain which internal part of the model seeks information from the input tokens. |
| Researcher Affiliation | Academia | Wenhao Wuλ Yizhong Wangδ Guangxuan Xiaoσ Hao Pengπ Yao Fuµ λPeking University δUniversity of Washington σMIT πUIUC µUniversity of Edinburgh EMAIL EMAIL EMAIL |
| Pseudocode | No | The paper describes the 'Retrieval Head Detection Algorithm' in Section 2, detailing the steps and criteria for calculating the retrieval score, and provides a formula (Equation 1). However, this is presented in prose within the text and not as a formally structured pseudocode or algorithm block (e.g., in a dedicated figure or box). |
| Open Source Code | Yes | https://github.com/nightdessert/Retrieval_Head |
| Open Datasets | Yes | Needle-in-a-Haystack (NIAH) Our retrieval head detection algorithm roots from the Needle-in-a Haystack test (NIAH)... We construct three sets of NIAH samples. MMLU (Hendrycks et al., 2020), Mu Si Que and GSM8K (Cobbe et al., 2021), with and without chain-of-thought (Co T) reasoning. |
| Dataset Splits | No | For each (q, k, x) sample, we conduct the NIAH test by evaluating the model s behavior over 20 different sequence lengths uniformly sampled between 1K and 50K tokens. At each length, q is inserted at 10 evenly distributed positions, from the start to the end of x. This allows us to evaluate the model s retrieval capabilities at varying depths and in diverse contexts. Our experiments show that the retrieval score stabilizes quickly, often converging after just a few samples. In total, each model undergoes approximately 600 retrieval testing instances. |
| Hardware Specification | No | A major challenge in deploying long-context models is the significant memory overhead caused by the large KV cache. For example, LLAMA 2 7B requires more than 50GB of memory to maintain a 100K-token KV cache, compared to less than 1GB for a 2K context. This discrepancy drastically reduces the concurrency of 100K-token queries, making deployment on systems like an 80GB A100 GPU prohibitively expensive. |
| Software Dependencies | No | The paper does not mention any specific software names with version numbers (e.g., Python, PyTorch, CUDA versions, or other libraries). |
| Experiment Setup | Yes | Our retrieval head detection algorithm roots from the Needle-in-a Haystack test (NIAH), which asks the model to copy-paste the input tokens to the output. Given a question q and its corresponding answer k (the needle ), we insert k into a context x (the haystack ) at a randomly chosen position indexed by iq. ... The language model is tasked with answering q based on the haystack with the inserted needle. ... We define the retrieval score as the frequency of a head s copy-paste operations. Specifically, during auto-regressive decoding (we use greedy decoding by default), denote the current token being generated as w and the attention scores of a head as a R|x|. ... For each (q, k, x) sample, we conduct the NIAH test by evaluating the model s behavior over 20 different sequence lengths uniformly sampled between 1K and 50K tokens. At each length, q is inserted at 10 evenly distributed positions, from the start to the end of x. ... To identify retrieval heads, we apply a threshold criterion. In our experiments (Fig. 3), a head is classified as a retrieval head if its average retrieval score exceeds 0.1, meaning that it successfully performs a copy-paste operation in at least 10% of the test cases. |