reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Investigating the Pre-Training Dynamics of In-Context Learning: Task Recognition vs. Task Learning

Authors: Xiaolei Wang, Xinyu Tang, Junyi Li, Xin Zhao, Ji-Rong Wen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate the effectiveness of our approach, we conduct experiments on extensive datasets and LLMs with various training settings. Experimental results demonstrate that this simple method can effectively enhance ICL performance, outperforming several competitive baselines, even with less than half the parameters of a larger LLM.
Researcher Affiliation	Academia	1Gaoling School of Artificial Intelligence, Renmin University of China 2School of Information, Renmin University of China 3Beijing Key Laboratory of Big Data Management and Analysis Methods 4Department of Computer Science, National University of Singapore
Pseudocode	No	The paper describes methods using mathematical formulas and text, but no explicit pseudocode blocks or algorithms are provided.
Open Source Code	Yes	The code is available at https://github.com/RUCAIBox/Competitive-ICL.
Open Datasets	Yes	Tasks and Datasets. Following Pan et al. (2023), we select 16 datasets across four types of tasks for the experiment: sentiment analysis, topic/state classification, toxicity detection, and natural language inference/paraphrase detection. Details about the datasets are depicted in Appendix A. ... SST-2 (Socher et al., 2013), financial phrasebank (Malo et al., 2014), emotion (Saravia et al., 2018), and poem sentiment (Sheng & Uthus, 2020).
Dataset Splits	Yes	Due to computational constraints, we sample 1000 examples from each dataset for evaluation. ... Additionally, we randomly sample 300 examples as the development set for validation in Section 4 and another 1000 examples as the test set for evaluation in all experiments from the development set.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper does not specify version numbers for any key software components or libraries used in the implementation.
Experiment Setup	Yes	To make the output as deterministic as possible, we set temperature=0 when sampling. We randomly sample 16 examples as demonstrations by default across the paper following Min et al. (2022). The discussion about the number of examples can be seen in Appendix B.2. We use minimal templates to construct demonstrations following Pan et al. (2023). Specifically, we use a single newline character (i.e., n) to connect each input-label pair and three ones to separate examples. We utilize symbols as labels in the abstract setting. Other kinds of abstract labels yield similar results as discussed in Appendix B.3. The results are averaged across five random seeds. ... ϵ is set to 0.01 in the experiment.