Investigating the Pre-Training Dynamics of In-Context Learning: Task Recognition vs. Task Learning
Authors: Xiaolei Wang, Xinyu Tang, Junyi Li, Xin Zhao, Ji-Rong Wen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate the effectiveness of our approach, we conduct experiments on extensive datasets and LLMs with various training settings. Experimental results demonstrate that this simple method can effectively enhance ICL performance, outperforming several competitive baselines, even with less than half the parameters of a larger LLM. |
| Researcher Affiliation | Academia | 1Gaoling School of Artificial Intelligence, Renmin University of China 2School of Information, Renmin University of China 3Beijing Key Laboratory of Big Data Management and Analysis Methods 4Department of Computer Science, National University of Singapore |
| Pseudocode | No | The paper describes methods using mathematical formulas and text, but no explicit pseudocode blocks or algorithms are provided. |
| Open Source Code | Yes | The code is available at https://github.com/RUCAIBox/Competitive-ICL. |
| Open Datasets | Yes | Tasks and Datasets. Following Pan et al. (2023), we select 16 datasets across four types of tasks for the experiment: sentiment analysis, topic/state classification, toxicity detection, and natural language inference/paraphrase detection. Details about the datasets are depicted in Appendix A. ... SST-2 (Socher et al., 2013), financial phrasebank (Malo et al., 2014), emotion (Saravia et al., 2018), and poem sentiment (Sheng & Uthus, 2020). |
| Dataset Splits | Yes | Due to computational constraints, we sample 1000 examples from each dataset for evaluation. ... Additionally, we randomly sample 300 examples as the development set for validation in Section 4 and another 1000 examples as the test set for evaluation in all experiments from the development set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not specify version numbers for any key software components or libraries used in the implementation. |
| Experiment Setup | Yes | To make the output as deterministic as possible, we set temperature=0 when sampling. We randomly sample 16 examples as demonstrations by default across the paper following Min et al. (2022). The discussion about the number of examples can be seen in Appendix B.2. We use minimal templates to construct demonstrations following Pan et al. (2023). Specifically, we use a single newline character (i.e., n) to connect each input-label pair and three ones to separate examples. We utilize symbols as labels in the abstract setting. Other kinds of abstract labels yield similar results as discussed in Appendix B.3. The results are averaged across five random seeds. ... ϵ is set to 0.01 in the experiment. |