reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Strategy Coopetition Explains the Emergence and Transience of In-Context Learning

Authors: Aaditya K Singh, Ted Moskovitz, Sara Dragutinović, Felix Hill, Stephanie C.Y. Chan, Andrew M Saxe

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we aim to extend the mechanistic understanding of ICL... To do so, we reproduce and investigate the key transience result in a simplified synthetic data setting with a 2-layer attention-only transformer. Using behavioral evaluators, we find the asymptotic strategy after the disappearance of ICL is not pure in-weights learning. Rather, it is a surprising hybrid strategy that we term context-constrained in-weights learning (CIWL, Section 4). Figure 1b shows a reproduction of the key transience phenomena in our simplified setting, with an extended figure in the appendix (Figure 11).
Researcher Affiliation	Collaboration	1Gatsby Computational Neuroscience Unit, University College London 2Anthropic AI, work completed while at the Gatsby Unit, UCL 3University of Oxford 4Google Deep Mind. Correspondence to: Aaditya K. Singh <EMAIL>.
Pseudocode	No	The paper includes a mathematical model in Section 6, but it is not presented as structured pseudocode or an algorithm block. It describes loss functions and dynamics without explicit step-by-step algorithmic procedures.
Open Source Code	Yes	All code is open-sourced at https://github.com/aadityasingh/icl-dynamics.
Open Datasets	Yes	Our few-shot learning task consists of sequences of exemplar-label pairs, where image exemplars are drawn from the Omniglot dataset of handwritten characters (Lake et al., 2015). Images were embedded using a Resnet18 encoder that was pretrained on Image Net (He et al., 2015; Russakovsky et al., 2015).
Dataset Splits	No	While the original Omniglot dataset has 1623 classes, we follow prior work (Chan et al., 2022) and augment it to 12984 classes by applying flips and rotations. Of these, we use a random 12800 for training. In Appendix B.3, we also considered using different # s of classes or exemplars, observing similar modulations to Singh et al. (2023) for the duration, timing, and magnitude of the transience effect.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications. It only mentions that models were trained in JAX.
Software Dependencies	No	All models were trained in JAX (Bradbury et al., 2018). The paper mentions JAX but does not specify a version number for it or any other software dependencies such as Python, specific deep learning frameworks, or operating systems.
Experiment Setup	Yes	We train 2-layer attention-only transformers (Vaswani et al., 2017; Elhage et al., 2021) on a synthetic few-shot learning task. We use dmodel = 64, with 8 heads per layer and learned absolute positional embeddings. As is common in mechanistic work (Olsson et al., 2022; Singh et al., 2024), we chose this minimal setting as it sufficed to reproduce key phenomena. We used the Adam optimizer (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.999, a learning rate of 10 5, and a batch size of 32 sequences.