Can In-context Learning Really Generalize to Out-of-distribution Tasks?

Authors: Qixun Wang, Yifei Wang, Xianghua Ying, Yisen Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we investigate the mechanism of in-context learning (ICL) on out-ofdistribution (OOD) tasks that were not encountered during training. To this end, we conduct synthetic experiments using a GPT-2 model to learn OOD mathematical functions through ICL. Our findings reveal that Transformers may struggle to learn OOD tasks via ICL.
Researcher Affiliation Academia 1 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2 MIT CSAIL 3 Institute for Artificial Intelligence, Peking University
Pseudocode No The paper describes algorithms in prose and mentions 'algorithm selection mechanism' but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https: //github.com/NOVAglow646/ICL-OOD.
Open Datasets No To this end, we conduct synthetic experiments using a GPT-2 model to learn OOD mathematical functions through ICL. Our findings reveal that Transformers may struggle to learn OOD tasks via ICL. Specifically, ICL operates by selecting a function within the pretraining hypothesis space and optimizing it via gradient descent using in-context examples, rather than learning truly novel functions.
Dataset Splits No The paper describes generating synthetic data and different function classes for evaluation, not specific training/test/validation splits of a fixed dataset. For example, it states 'xi Rd are sampled from a standard Gaussian distribution N(0, 1) with dimension d = 20'.
Hardware Specification No The paper mentions using GPT-2 and Llama models but does not provide specific hardware details like GPU/CPU models or memory used for experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers. It mentions models like GPT-2 and Llama, but not the software environment or libraries used for implementation with versions.
Experiment Setup Yes The models are optimized using SGD with learning rate 1e-3 for 1000 steps. All models are trained with 200,000 64 sequences, where 200,000 is the number of training steps and 64 is the batch size.