reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Retrieval Head Mechanistically Explains Long-Context Factuality

Authors: Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information (either copy-paste or paraphrase), which we dub retrieval heads. We identify intriguing properties of retrieval heads: (1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information retrieval. (4) dynamically activated: take Llama-2 7B for example, 12 retrieval heads always attend to the required information no matter how the context is changed. The rest of the retrieval heads are activated in different contexts. (5) causal: completely pruning retrieval heads leads to failure in retrieving relevant information and results in hallucination, while pruning random non-retrieval heads does not affect the model s retrieval ability. We further show that retrieval heads strongly influence chain-of-thought (Co T) reasoning, where the model needs to frequently refer back the question and previously-generated context. Conversely, tasks where the model directly generates the answer using its intrinsic knowledge are less impacted by masking out retrieval heads. These observations collectively explain which internal part of the model seeks information from the input tokens.
Researcher Affiliation	Academia	Wenhao Wuλ Yizhong Wangδ Guangxuan Xiaoσ Hao Pengπ Yao Fuµ λPeking University δUniversity of Washington σMIT πUIUC µUniversity of Edinburgh EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes the 'Retrieval Head Detection Algorithm' in Section 2, detailing the steps and criteria for calculating the retrieval score, and provides a formula (Equation 1). However, this is presented in prose within the text and not as a formally structured pseudocode or algorithm block (e.g., in a dedicated figure or box).
Open Source Code	Yes	https://github.com/nightdessert/Retrieval_Head
Open Datasets	Yes	Needle-in-a-Haystack (NIAH) Our retrieval head detection algorithm roots from the Needle-in-a Haystack test (NIAH)... We construct three sets of NIAH samples. MMLU (Hendrycks et al., 2020), Mu Si Que and GSM8K (Cobbe et al., 2021), with and without chain-of-thought (Co T) reasoning.
Dataset Splits	No	For each (q, k, x) sample, we conduct the NIAH test by evaluating the model s behavior over 20 different sequence lengths uniformly sampled between 1K and 50K tokens. At each length, q is inserted at 10 evenly distributed positions, from the start to the end of x. This allows us to evaluate the model s retrieval capabilities at varying depths and in diverse contexts. Our experiments show that the retrieval score stabilizes quickly, often converging after just a few samples. In total, each model undergoes approximately 600 retrieval testing instances.
Hardware Specification	No	A major challenge in deploying long-context models is the significant memory overhead caused by the large KV cache. For example, LLAMA 2 7B requires more than 50GB of memory to maintain a 100K-token KV cache, compared to less than 1GB for a 2K context. This discrepancy drastically reduces the concurrency of 100K-token queries, making deployment on systems like an 80GB A100 GPU prohibitively expensive.
Software Dependencies	No	The paper does not mention any specific software names with version numbers (e.g., Python, PyTorch, CUDA versions, or other libraries).
Experiment Setup	Yes	Our retrieval head detection algorithm roots from the Needle-in-a Haystack test (NIAH), which asks the model to copy-paste the input tokens to the output. Given a question q and its corresponding answer k (the needle ), we insert k into a context x (the haystack ) at a randomly chosen position indexed by iq. ... The language model is tasked with answering q based on the haystack with the inserted needle. ... We define the retrieval score as the frequency of a head s copy-paste operations. Specifically, during auto-regressive decoding (we use greedy decoding by default), denote the current token being generated as w and the attention scores of a head as a R\|x\|. ... For each (q, k, x) sample, we conduct the NIAH test by evaluating the model s behavior over 20 different sequence lengths uniformly sampled between 1K and 50K tokens. At each length, q is inserted at 10 evenly distributed positions, from the start to the end of x. ... To identify retrieval heads, we apply a threshold criterion. In our experiments (Fig. 3), a head is classified as a retrieval head if its average retrieval score exceeds 0.1, meaning that it successfully performs a copy-paste operation in at least 10% of the test cases.