Can In-context Learning Really Generalize to Out-of-distribution Tasks?
Authors: Qixun Wang, Yifei Wang, Xianghua Ying, Yisen Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we investigate the mechanism of in-context learning (ICL) on out-ofdistribution (OOD) tasks that were not encountered during training. To this end, we conduct synthetic experiments using a GPT-2 model to learn OOD mathematical functions through ICL. Our findings reveal that Transformers may struggle to learn OOD tasks via ICL. |
| Researcher Affiliation | Academia | 1 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2 MIT CSAIL 3 Institute for Artificial Intelligence, Peking University |
| Pseudocode | No | The paper describes algorithms in prose and mentions 'algorithm selection mechanism' but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https: //github.com/NOVAglow646/ICL-OOD. |
| Open Datasets | No | To this end, we conduct synthetic experiments using a GPT-2 model to learn OOD mathematical functions through ICL. Our findings reveal that Transformers may struggle to learn OOD tasks via ICL. Specifically, ICL operates by selecting a function within the pretraining hypothesis space and optimizing it via gradient descent using in-context examples, rather than learning truly novel functions. |
| Dataset Splits | No | The paper describes generating synthetic data and different function classes for evaluation, not specific training/test/validation splits of a fixed dataset. For example, it states 'xi Rd are sampled from a standard Gaussian distribution N(0, 1) with dimension d = 20'. |
| Hardware Specification | No | The paper mentions using GPT-2 and Llama models but does not provide specific hardware details like GPU/CPU models or memory used for experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. It mentions models like GPT-2 and Llama, but not the software environment or libraries used for implementation with versions. |
| Experiment Setup | Yes | The models are optimized using SGD with learning rate 1e-3 for 1000 steps. All models are trained with 200,000 64 sequences, where 200,000 is the number of training steps and 64 is the batch size. |