Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence
Authors: Gouki Minegishi, Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model s circuit during training. |
| Researcher Affiliation | Academia | Gouki Minegishi 1 Hiroki Furuta 1 Shohei Taniguchi 1 Yusuke Iwasawa 1 Yutaka Matsuo 1 1The University of Tokyo. Correspondence to: Gouki Minegishi <EMAIL>. |
| Pseudocode | No | The paper describes the network structure and attention computation using equations in Section 3.2 and Appendix B.1, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/ gouki510/In-Context-Meta-Learning |
| Open Datasets | Yes | Specifically, we use the SST22 dataset from the GLUE benchmark, consisting of 872 sentiment-labeled samples. |
| Dataset Splits | No | The paper describes the generation of examples and queries for its In-Context Meta-Learning setting in Section 3.1 and mentions using the SST2 dataset in a 2-shot setup in Section 6. However, it does not specify explicit training, validation, or test dataset splits (e.g., percentages or counts) for the overall training of the models. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used for running its experiments. |
| Software Dependencies | No | The paper lists training details such as the optimizer (Vanilla SGD) and loss function (Cross-entropy) in Table 3. However, it does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, or TensorFlow versions). |
| Experiment Setup | Yes | Following prior research (Reddy, 2023), we use a two-layer attention-only transformer shown in Figure 1-(b)... The classifier is a two-layer MLP with ReLU activations, followed by a softmax layer producing probabilities over L labels. We train this network to classify the query item xq into one of the L labels using cross-entropy loss. Both the query/key dimension and the MLP hidden layer dimension are set to 128. We use a batch size of 128 and optimize with vanilla stochastic gradient descent at a learning rate of 0.01. We use T = 3, K = 64, L = 32, N = 4, D = 63, ϵ = 0.1, p B = 0, unless otherwise specified. |