reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Which Attention Heads Matter for In-Context Learning?

Authors: Kayo Yin, Jacob Steinhardt

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through detailed ablations, we discover that few-shot ICL performance depends primarily on FV heads, especially in larger models. In addition, we uncover that FV and induction heads are connected: many FV heads start as induction heads during training before transitioning to the FV mechanism. This leads us to speculate that induction facilitates learning the more complex FV mechanism that ultimately drives few-shot ICL1.
Researcher Affiliation	Academia	Kayo Yin Jacob Steinhardt UC Berkeley Correspondence to: Kayo Yin <EMAIL>, Jacob Steinhardt <EMAIL>.
Pseudocode	No	The paper describes the methodologies for identifying induction and FV heads, as well as the ablation studies, in paragraph form using mathematical notation where appropriate. It does not include distinct pseudocode or algorithm blocks.
Open Source Code	Yes	1Code and data: https://github.com/kayoyin/icl-heads.
Open Datasets	Yes	To identify FV heads, we employ the casual mediation analysis framework from Todd et al. (2024). For each ICL task t in our task set T , where t is defined by a dataset Pt of in-context prompts pt i Pt consisting of input-output pairs (xi, yi), we: ... For each attention head, we take the mean FV score across 37 natural language ICL tasks from (Todd et al., 2024) (Appendix A.8), using 100 prompts per task. ... We measure token-loss difference by taking the loss of the 50th token in the input prompt minus the loss of the 500th token in the prompt3, averaged over 10,000 randomly sampled examples from the Pile dataset (Gao et al., 2021). ... In Table 3, we list the ICL tasks used in this study. We refer to Todd et al. (2024) and Feng & Steinhardt (2024) for a detailed description of each task.
Dataset Splits	Yes	To avoid leakage between ICL tasks used to identify FV heads and those used to evaluate FV head ablations, we randomly split the 37 ICL tasks from Todd et al. (2024) into 26 tasks used to measure FV scores of heads, and 11 tasks to evaluate ICL performance. We also add 8 new tasks for ICL evaluation: 4 tasks are variations of tasks in Todd et al. (2024), and 4 are binding tasks from Feng & Steinhardt (2024). In total, we evaluate ICL accuracy on 19 natural language tasks, with 100 prompts per task. Each prompt contains 10 input-output demonstration pairs followed by a single test instance. ... We measure token-loss difference ... averaged over 10,000 randomly sampled examples from the Pile dataset (Gao et al., 2021).
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used to run the experiments. It mentions the models studied (Pythia, GPT-2, Llama 2) and their parameter counts, but not the computational resources.
Software Dependencies	No	We use huggingface implementations (Wolf et al., 2020) for all models. We measure their induction scores using the Transformer Lens framework (Nanda & Bloom, 2022). While these frameworks are mentioned, specific version numbers are not provided.
Experiment Setup	Yes	Ablation. To assess the causal contribution of different attention heads, we measure how ICL performance changes when specific heads are disabled. We use mean ablation, where we replace each target head s output with its average output across our task dataset (described in later sections). ... To control for the correlation between induction and FV heads identified in Section 3, we introduce ablation with exclusion : when ablating n FV heads, we select the top n heads by FV score that are not in the top 2% by induction score, and vice versa. ... We evaluate the impact of ablating different proportions (1-20%) of the top attention heads based on induction or FV score, across all models. ... Each ICL task is defined by a set of input-output pairs (xi, yi). The model is prompted with 10 input-output exemplar pairs that demonstrate this task, and one query input xq that corresponds to a target output yq that is not part of the model s prompt.