Understanding Emergent In-Context Learning from a Kernel Regression Perspective
Authors: Chi Han, Ziqi Wang, Han Zhao, Heng Ji
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Following our theoretical investigation, in Section 5 we conduct empirical studies to verify our explanation of in-context learning of LLMs in more detail. Our results reveal that during LLM ICL, the attention map used by the last token to predict the next token is allocated in accordance with our explanation. By plugging attention values into our equation, we are also able to reconstruct the model s output with over 80% accuracy. Moreover, we are able to reveal how information necessary to kernel regression is computed in intermediate LLM layers. |
| Researcher Affiliation | Academia | Chi Han EMAIL Ziqi Wang EMAIL Han Zhao EMAIL Heng Ji EMAIL Siebel School of Computing and Data Science University of Illinois Urbana-Champaign |
| Pseudocode | No | The paper describes methods and theoretical analysis using mathematical equations and textual descriptions, but does not present any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and resources are publicly available at https://github.com/Glaciohound/Explain-ICL-As-Kernel-Regression. |
| Open Datasets | Yes | In this section, we use the validation set of the sst2 dataset as a case study, while results on more tasks can be found in Appendix B. In specific, we experiment on Rotten Tomatoes1, Tweet Eval2 s (hate, irony and offensive subtasks) and MNLI3. The results are as follows. 1https://huggingface.co/datasets/rotten_tomatoes/ 2https://huggingface.co/datasets/tweet_eval/ 3https://huggingface.co/datasets/glue/viewer/mnli_matched/test |
| Dataset Splits | Yes | We uniformly use 700 data in each validation set to represent tasks in a balanced way. We select the attention head that has the highest correlation with model predictions. This head is then evaluated on ICL prediction reconstruction and task performance on a held-out set of 300 data per task. |
| Hardware Specification | Yes | Limited by computation resources in an academic lab, we analyze the GPT-J 6B model(Wang & Komatsuzaki, 2021) on one Tesla V100. |
| Software Dependencies | No | The paper mentions various models (e.g., GPT-J 6B, Llama-2, Llama-3) and datasets used, but it does not specify software dependencies like programming language versions or library versions required for replication. |
| Experiment Setup | Yes | For each head, we conduct Ridge regression with λ = 0.01 to fit the task in these 2 questions. |