reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Toward Understanding In-context vs. In-weight Learning

Authors: Bryan Chan, Xinyi Chen, Andras Gyorgy, Dale Schuurmans

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate empirically that training transformers on synthetic and Omniglot (Lake et al., 2015) data drawn from the stylized distribution in our theoretical model follows the predictions of the developed theory. We further provide examples where ICL is persistent or where ICL and IWL are present in parallel when ICL and IWL can achieve the same performance, which suggests that the transience of ICL is more complicated in this case and it might depend on its finite-time performance (which is much harder to analyze experimentally). Finally, to bridge the gap from theory to practice, we show that fine-tuning an LLM (Gemini Nano v1 Gemini Team, Google, 2023) to memorize certain data can result in a reduction of its ICL ability.
Researcher Affiliation	Collaboration	Bryan Chan1 , Xinyi Chen2 , Andr as Gy orgy2, Dale Schuurmans1,2 1University of Alberta 2Google Deep Mind EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Bi-level update Input: horizon N, no-regret learner Aα, Ag for g, and Ah for h. for t = 1 to N do Receive example xt, predict with ft( xt; αt) = αt( xt)g( xt; wt) + (1 αt( xt))h( xt; ut) Observe losses ℓt(ft( xt)), ℓt(g( xt; wt)), ℓt(h( xt; ut)) Update wt+1 = Ag(ℓ1(g( xt; w1)), . . . , ℓt(g( xt; wt))) Update ut+1 = Ah(ℓ1(h( xt; u1)), . . . , ℓt(h( xt; ut))) Define mt(αt) = αt( xt)(ℓt(g( xt; wt)) ℓt(h( xt; ut))) Update αt+1 = Aα(m1, . . . , mt) end for
Open Source Code	Yes	Our code is available here: https://github.com/chanb/icl_vs_iwl.
Open Datasets	Yes	We demonstrate empirically that training transformers on synthetic and Omniglot (Lake et al., 2015) data drawn from the stylized distribution in our theoretical model follows the predictions of the developed theory.
Dataset Splits	Yes	We have access to a training dataset S of N examples, where in addition to the usual (input, label) pairs in a classification problem, each example has a context... during training the task is to predict yi given xi, while the final goal is to minimize the prediction error for a new sample ( x, y) sampled independently from D where x = (x1, y1, x2, y2, . . . , x L, y L, x). ... To evaluate whether the trained models exhibit IWL or ICL, we evaluate the trained models on in-base distribution (IBD) and out-of-base distribution (OOBD).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It only describes model architectures and training parameters.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). It mentions models like Transformer (GPT), ResNet, Gemini Nano 1, and GRU, but not the specific software implementations or versions used.
Experiment Setup	Yes	Experimental setup. We conduct experiments by training a transformer (GPT) end-to-end (Radford et al., 2018). The models consist of two transformer decoder layers, each with a single attention head and processes 64-dimensional embeddings. Both the input and output tokenizers are linear projections, where the former is the identity matrix. For prediction we take the last token output from the last transformer block and feed it into a linear layer followed by a softmax. To probe the difficulty of IWL and ICL, we separately train two transformers for these two settings by using data with prelevant = 0.0 and prelevant = 1.0; we refer to these models as the IW and IC predictors, respectively. These transformers only act as proxies for measuring whether it is possible to perform ICL and IWL in the idealized case. A generic model trained on data with prelevant = 0.9 is referred to as the transformer. Unless specified otherwise, all models are trained using cross-entropy loss for 50K gradient updates with batch size of 32. We set CH = {0, . . . , 4} and CL = {5, . . . , 9}, and set the input dimension d = 64 with Σ = σ2Id where σ = 0.2 and Id is the d d identity matrix. Furthermore, we fix the context length L = 1, the probability of sampling high-frequency classes phigh = 0.9, and vary the number of total samples N = {26, 28, . . . , 220}.