When can in-context learning generalize out of task distribution?
Authors: Chase Goddard, Lindsay M. Smith, Vudtiwat Ngampruetikorn, David J. Schwab
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize out-of-distribution. Previous work has focused on the number of distinct tasks necessary in the pretraining dataset. Here, we use a different notion of task diversity to study the emergence of ICL in transformers trained on linear functions. We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining task distribution, to a solution which generalizes out of distribution to the entire task space. We also investigate the nature of the solutions learned by the transformer on both sides of the transition, and observe similar transitions in nonlinear regression problems. We construct a phase diagram to characterize how our concept of task diversity interacts with the number of pretraining tasks. In addition, we explore how factors such as the depth of the model and the dimensionality of the regression problem influence the transition. |
| Researcher Affiliation | Academia | 1Joseph Henry Laboratories of Physics, Princeton University, Princeton, NJ, USA 2School of Physics, University of Sydney, Sydney, Australia 3Initiative for the Theoretical Sciences, The Graduate Center, CUNY, New York, NY, USA. Correspondence to: Chase Goddard <EMAIL>. |
| Pseudocode | No | The paper describes methods and mathematical derivations but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code available at https://github.com/cwgoddard/OOD ICL |
| Open Datasets | No | Each task is a linear map in d dimensions, w Rd, and we control task diversity by sampling tasks from hyperspherical caps of varying half-angles. The transformer takes as input a sequence of up to n pairs {x1, y1, . . . , xn}, where yi = w T xi + ϵi, with xi N(0, Id) and ϵi N(0, σ2). The paper describes how data is generated programmatically but does not provide concrete access to a pre-existing public dataset or a repository for their generated datasets. |
| Dataset Splits | No | Pretraining task distribution: We define a family of task distributions parameterized by ϕ [0, π] (See Fig 1B). We take Sd 1(ϕ) to be a section of the surface of the hypersphere in d dimensions... We then define the task distribution as a uniform distribution on this spherical cap... Test task distribution: We evaluate the performance of the transformer over a family of task distributions parameterized by δ, δ [0, π] (See Fig 1B)... The test task distribution is then uniform over this set... The paper describes how synthetic data is generated for pretraining and testing based on different distributions, rather than providing fixed splits of a static dataset. |
| Hardware Specification | Yes | All models were trained on a single GPU, either a MIG GPU with 10GB of memory or an A100 with 40GB of memory, and took 3hrs to train. |
| Software Dependencies | No | All code was written in Python using the Py Torch library (Paszke et al., 2019). This statement mentions the programming language and a library but does not provide specific version numbers for both key software components. |
| Experiment Setup | Yes | All models were trained for 58,000 steps using a batch size of 128 and a constant learning rate of 3 10 4. All models were converged at the end of training. We use Adam W (Loshchilov & Hutter, 2017) to optimize the MSE |