reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Understanding Emergent In-Context Learning from a Kernel Regression Perspective

Authors: Chi Han, Ziqi Wang, Han Zhao, Heng Ji

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Following our theoretical investigation, in Section 5 we conduct empirical studies to verify our explanation of in-context learning of LLMs in more detail. Our results reveal that during LLM ICL, the attention map used by the last token to predict the next token is allocated in accordance with our explanation. By plugging attention values into our equation, we are also able to reconstruct the model s output with over 80% accuracy. Moreover, we are able to reveal how information necessary to kernel regression is computed in intermediate LLM layers.
Researcher Affiliation	Academia	Chi Han EMAIL Ziqi Wang EMAIL Han Zhao EMAIL Heng Ji EMAIL Siebel School of Computing and Data Science University of Illinois Urbana-Champaign
Pseudocode	No	The paper describes methods and theoretical analysis using mathematical equations and textual descriptions, but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and resources are publicly available at https://github.com/Glaciohound/Explain-ICL-As-Kernel-Regression.
Open Datasets	Yes	In this section, we use the validation set of the sst2 dataset as a case study, while results on more tasks can be found in Appendix B. In speciﬁc, we experiment on Rotten Tomatoes1, Tweet Eval2 s (hate, irony and offensive subtasks) and MNLI3. The results are as follows. 1https://huggingface.co/datasets/rotten_tomatoes/ 2https://huggingface.co/datasets/tweet_eval/ 3https://huggingface.co/datasets/glue/viewer/mnli_matched/test
Dataset Splits	Yes	We uniformly use 700 data in each validation set to represent tasks in a balanced way. We select the attention head that has the highest correlation with model predictions. This head is then evaluated on ICL prediction reconstruction and task performance on a held-out set of 300 data per task.
Hardware Specification	Yes	Limited by computation resources in an academic lab, we analyze the GPT-J 6B model(Wang & Komatsuzaki, 2021) on one Tesla V100.
Software Dependencies	No	The paper mentions various models (e.g., GPT-J 6B, Llama-2, Llama-3) and datasets used, but it does not specify software dependencies like programming language versions or library versions required for replication.
Experiment Setup	Yes	For each head, we conduct Ridge regression with λ = 0.01 to ﬁt the task in these 2 questions.