Selective induction Heads: How Transformers Select Causal Structures in Context
Authors: Francesco D'Angelo, francesco croce, Nicolas Flammarion
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that attention-only transformers learn this mechanism to predict the next token by identifying the correct lag and copying the corresponding token from the past. We conduct a series of experiments to empirically validate our construction and determine whether transformers trained via gradient descent learns it. |
| Researcher Affiliation | Academia | Theory of Machine Learning Lab, EPFL, Lausanne, Switzerland |
| Pseudocode | Yes | Algorithm 1 Generate Dataset of N Sequences from Interleaved Markov Chains |
| Open Source Code | No | The paper does not provide any statement about releasing source code or a link to a code repository. |
| Open Datasets | No | The framework consists of sequences of length T on a finite alphabet of tokens S, generated by K distinct sources... Algorithm 1 Generate Dataset of N Sequences from Interleaved Markov Chains. The data is synthetically generated according to Algorithm 1, not a pre-existing publicly available dataset with a link or DOI. |
| Dataset Splits | No | At each step, we generate a fresh batch (size 256) of sequences (length 128) via Alg. 1, and train using Adam optimizer with fixed learning rate 0.001 and no weight decay. The paper describes on-the-fly data generation for training rather than static dataset splits. |
| Hardware Specification | No | The paper does not specify any particular hardware (GPU, CPU models, etc.) used for running experiments. |
| Software Dependencies | No | We train ... using Adam optimizer with fixed learning rate 0.001 and no weight decay. While an optimizer is mentioned, no specific software or library versions are provided. |
| Experiment Setup | Yes | At each step, we generate a fresh batch (size 256) of sequences (length 128) via Alg. 1, and train using Adam optimizer with fixed learning rate 0.001 and no weight decay. For the standard transformer embedding size we tested 128, 64 and d QK = 32. For the constructions, we fix β = 100 and λ = 500. |