Strategy Coopetition Explains the Emergence and Transience of In-Context Learning
Authors: Aaditya K Singh, Ted Moskovitz, Sara Dragutinović, Felix Hill, Stephanie C.Y. Chan, Andrew M Saxe
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we aim to extend the mechanistic understanding of ICL... To do so, we reproduce and investigate the key transience result in a simplified synthetic data setting with a 2-layer attention-only transformer. Using behavioral evaluators, we find the asymptotic strategy after the disappearance of ICL is not pure in-weights learning. Rather, it is a surprising hybrid strategy that we term context-constrained in-weights learning (CIWL, Section 4). Figure 1b shows a reproduction of the key transience phenomena in our simplified setting, with an extended figure in the appendix (Figure 11). |
| Researcher Affiliation | Collaboration | 1Gatsby Computational Neuroscience Unit, University College London 2Anthropic AI, work completed while at the Gatsby Unit, UCL 3University of Oxford 4Google Deep Mind. Correspondence to: Aaditya K. Singh <EMAIL>. |
| Pseudocode | No | The paper includes a mathematical model in Section 6, but it is not presented as structured pseudocode or an algorithm block. It describes loss functions and dynamics without explicit step-by-step algorithmic procedures. |
| Open Source Code | Yes | All code is open-sourced at https://github.com/aadityasingh/icl-dynamics. |
| Open Datasets | Yes | Our few-shot learning task consists of sequences of exemplar-label pairs, where image exemplars are drawn from the Omniglot dataset of handwritten characters (Lake et al., 2015). Images were embedded using a Resnet18 encoder that was pretrained on Image Net (He et al., 2015; Russakovsky et al., 2015). |
| Dataset Splits | No | While the original Omniglot dataset has 1623 classes, we follow prior work (Chan et al., 2022) and augment it to 12984 classes by applying flips and rotations. Of these, we use a random 12800 for training. In Appendix B.3, we also considered using different # s of classes or exemplars, observing similar modulations to Singh et al. (2023) for the duration, timing, and magnitude of the transience effect. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications. It only mentions that models were trained in JAX. |
| Software Dependencies | No | All models were trained in JAX (Bradbury et al., 2018). The paper mentions JAX but does not specify a version number for it or any other software dependencies such as Python, specific deep learning frameworks, or operating systems. |
| Experiment Setup | Yes | We train 2-layer attention-only transformers (Vaswani et al., 2017; Elhage et al., 2021) on a synthetic few-shot learning task. We use dmodel = 64, with 8 heads per layer and learned absolute positional embeddings. As is common in mechanistic work (Olsson et al., 2022; Singh et al., 2024), we chose this minimal setting as it sufficed to reproduce key phenomena. We used the Adam optimizer (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.999, a learning rate of 10 5, and a batch size of 32 sequences. |