Toward Understanding In-context vs. In-weight Learning

Authors: Bryan Chan, Xinyi Chen, Andras Gyorgy, Dale Schuurmans

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate empirically that training transformers on synthetic and Omniglot (Lake et al., 2015) data drawn from the stylized distribution in our theoretical model follows the predictions of the developed theory. We further provide examples where ICL is persistent or where ICL and IWL are present in parallel when ICL and IWL can achieve the same performance, which suggests that the transience of ICL is more complicated in this case and it might depend on its finite-time performance (which is much harder to analyze experimentally). Finally, to bridge the gap from theory to practice, we show that fine-tuning an LLM (Gemini Nano v1 Gemini Team, Google, 2023) to memorize certain data can result in a reduction of its ICL ability.
Researcher Affiliation Collaboration Bryan Chan1 , Xinyi Chen2 , Andr as Gy orgy2, Dale Schuurmans1,2 1University of Alberta 2Google Deep Mind EMAIL EMAIL
Pseudocode Yes Algorithm 1 Bi-level update Input: horizon N, no-regret learner Aα, Ag for g, and Ah for h. for t = 1 to N do Receive example xt, predict with ft( xt; αt) = αt( xt)g( xt; wt) + (1 αt( xt))h( xt; ut) Observe losses ℓt(ft( xt)), ℓt(g( xt; wt)), ℓt(h( xt; ut)) Update wt+1 = Ag(ℓ1(g( xt; w1)), . . . , ℓt(g( xt; wt))) Update ut+1 = Ah(ℓ1(h( xt; u1)), . . . , ℓt(h( xt; ut))) Define mt(αt) = αt( xt)(ℓt(g( xt; wt)) ℓt(h( xt; ut))) Update αt+1 = Aα(m1, . . . , mt) end for
Open Source Code Yes Our code is available here: https://github.com/chanb/icl_vs_iwl.
Open Datasets Yes We demonstrate empirically that training transformers on synthetic and Omniglot (Lake et al., 2015) data drawn from the stylized distribution in our theoretical model follows the predictions of the developed theory.
Dataset Splits Yes We have access to a training dataset S of N examples, where in addition to the usual (input, label) pairs in a classification problem, each example has a context... during training the task is to predict yi given xi, while the final goal is to minimize the prediction error for a new sample ( x, y) sampled independently from D where x = (x1, y1, x2, y2, . . . , x L, y L, x). ... To evaluate whether the trained models exhibit IWL or ICL, we evaluate the trained models on in-base distribution (IBD) and out-of-base distribution (OOBD).
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It only describes model architectures and training parameters.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). It mentions models like Transformer (GPT), ResNet, Gemini Nano 1, and GRU, but not the specific software implementations or versions used.
Experiment Setup Yes Experimental setup. We conduct experiments by training a transformer (GPT) end-to-end (Radford et al., 2018). The models consist of two transformer decoder layers, each with a single attention head and processes 64-dimensional embeddings. Both the input and output tokenizers are linear projections, where the former is the identity matrix. For prediction we take the last token output from the last transformer block and feed it into a linear layer followed by a softmax. To probe the difficulty of IWL and ICL, we separately train two transformers for these two settings by using data with prelevant = 0.0 and prelevant = 1.0; we refer to these models as the IW and IC predictors, respectively. These transformers only act as proxies for measuring whether it is possible to perform ICL and IWL in the idealized case. A generic model trained on data with prelevant = 0.9 is referred to as the transformer. Unless specified otherwise, all models are trained using cross-entropy loss for 50K gradient updates with batch size of 32. We set CH = {0, . . . , 4} and CL = {5, . . . , 9}, and set the input dimension d = 64 with Σ = σ2Id where σ = 0.2 and Id is the d d identity matrix. Furthermore, we fix the context length L = 1, the probability of sampling high-frequency classes phigh = 0.9, and vary the number of total samples N = {26, 28, . . . , 220}.