Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence

Authors: Gouki Minegishi, Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model s circuit during training.
Researcher Affiliation Academia Gouki Minegishi 1 Hiroki Furuta 1 Shohei Taniguchi 1 Yusuke Iwasawa 1 Yutaka Matsuo 1 1The University of Tokyo. Correspondence to: Gouki Minegishi <EMAIL>.
Pseudocode No The paper describes the network structure and attention computation using equations in Section 3.2 and Appendix B.1, but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/ gouki510/In-Context-Meta-Learning
Open Datasets Yes Specifically, we use the SST22 dataset from the GLUE benchmark, consisting of 872 sentiment-labeled samples.
Dataset Splits No The paper describes the generation of examples and queries for its In-Context Meta-Learning setting in Section 3.1 and mentions using the SST2 dataset in a 2-shot setup in Section 6. However, it does not specify explicit training, validation, or test dataset splits (e.g., percentages or counts) for the overall training of the models.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used for running its experiments.
Software Dependencies No The paper lists training details such as the optimizer (Vanilla SGD) and loss function (Cross-entropy) in Table 3. However, it does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, or TensorFlow versions).
Experiment Setup Yes Following prior research (Reddy, 2023), we use a two-layer attention-only transformer shown in Figure 1-(b)... The classifier is a two-layer MLP with ReLU activations, followed by a softmax layer producing probabilities over L labels. We train this network to classify the query item xq into one of the L labels using cross-entropy loss. Both the query/key dimension and the MLP hidden layer dimension are set to 128. We use a batch size of 128 and optimize with vanilla stochastic gradient descent at a learning rate of 0.01. We use T = 3, K = 64, L = 32, N = 4, D = 63, ϵ = 0.1, p B = 0, unless otherwise specified.