reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can In-context Learning Really Generalize to Out-of-distribution Tasks?

Authors: Qixun Wang, Yifei Wang, Xianghua Ying, Yisen Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we investigate the mechanism of in-context learning (ICL) on out-ofdistribution (OOD) tasks that were not encountered during training. To this end, we conduct synthetic experiments using a GPT-2 model to learn OOD mathematical functions through ICL. Our findings reveal that Transformers may struggle to learn OOD tasks via ICL.
Researcher Affiliation	Academia	1 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2 MIT CSAIL 3 Institute for Artificial Intelligence, Peking University
Pseudocode	No	The paper describes algorithms in prose and mentions 'algorithm selection mechanism' but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https: //github.com/NOVAglow646/ICL-OOD.
Open Datasets	No	To this end, we conduct synthetic experiments using a GPT-2 model to learn OOD mathematical functions through ICL. Our findings reveal that Transformers may struggle to learn OOD tasks via ICL. Specifically, ICL operates by selecting a function within the pretraining hypothesis space and optimizing it via gradient descent using in-context examples, rather than learning truly novel functions.
Dataset Splits	No	The paper describes generating synthetic data and different function classes for evaluation, not specific training/test/validation splits of a fixed dataset. For example, it states 'xi Rd are sampled from a standard Gaussian distribution N(0, 1) with dimension d = 20'.
Hardware Specification	No	The paper mentions using GPT-2 and Llama models but does not provide specific hardware details like GPU/CPU models or memory used for experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers. It mentions models like GPT-2 and Llama, but not the software environment or libraries used for implementation with versions.
Experiment Setup	Yes	The models are optimized using SGD with learning rate 1e-3 for 1000 steps. All models are trained with 200,000 64 sequences, where 200,000 is the number of training steps and 64 is the batch size.