Understanding Emergent In-Context Learning from a Kernel Regression Perspective

Authors: Chi Han, Ziqi Wang, Han Zhao, Heng Ji

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Following our theoretical investigation, in Section 5 we conduct empirical studies to verify our explanation of in-context learning of LLMs in more detail. Our results reveal that during LLM ICL, the attention map used by the last token to predict the next token is allocated in accordance with our explanation. By plugging attention values into our equation, we are also able to reconstruct the model s output with over 80% accuracy. Moreover, we are able to reveal how information necessary to kernel regression is computed in intermediate LLM layers.
Researcher Affiliation Academia Chi Han EMAIL Ziqi Wang EMAIL Han Zhao EMAIL Heng Ji EMAIL Siebel School of Computing and Data Science University of Illinois Urbana-Champaign
Pseudocode No The paper describes methods and theoretical analysis using mathematical equations and textual descriptions, but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and resources are publicly available at https://github.com/Glaciohound/Explain-ICL-As-Kernel-Regression.
Open Datasets Yes In this section, we use the validation set of the sst2 dataset as a case study, while results on more tasks can be found in Appendix B. In specific, we experiment on Rotten Tomatoes1, Tweet Eval2 s (hate, irony and offensive subtasks) and MNLI3. The results are as follows. 1https://huggingface.co/datasets/rotten_tomatoes/ 2https://huggingface.co/datasets/tweet_eval/ 3https://huggingface.co/datasets/glue/viewer/mnli_matched/test
Dataset Splits Yes We uniformly use 700 data in each validation set to represent tasks in a balanced way. We select the attention head that has the highest correlation with model predictions. This head is then evaluated on ICL prediction reconstruction and task performance on a held-out set of 300 data per task.
Hardware Specification Yes Limited by computation resources in an academic lab, we analyze the GPT-J 6B model(Wang & Komatsuzaki, 2021) on one Tesla V100.
Software Dependencies No The paper mentions various models (e.g., GPT-J 6B, Llama-2, Llama-3) and datasets used, but it does not specify software dependencies like programming language versions or library versions required for replication.
Experiment Setup Yes For each head, we conduct Ridge regression with λ = 0.01 to fit the task in these 2 questions.