Selective induction Heads: How Transformers Select Causal Structures in Context

Authors: Francesco D'Angelo, francesco croce, Nicolas Flammarion

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that attention-only transformers learn this mechanism to predict the next token by identifying the correct lag and copying the corresponding token from the past. We conduct a series of experiments to empirically validate our construction and determine whether transformers trained via gradient descent learns it.
Researcher Affiliation Academia Theory of Machine Learning Lab, EPFL, Lausanne, Switzerland
Pseudocode Yes Algorithm 1 Generate Dataset of N Sequences from Interleaved Markov Chains
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository.
Open Datasets No The framework consists of sequences of length T on a finite alphabet of tokens S, generated by K distinct sources... Algorithm 1 Generate Dataset of N Sequences from Interleaved Markov Chains. The data is synthetically generated according to Algorithm 1, not a pre-existing publicly available dataset with a link or DOI.
Dataset Splits No At each step, we generate a fresh batch (size 256) of sequences (length 128) via Alg. 1, and train using Adam optimizer with fixed learning rate 0.001 and no weight decay. The paper describes on-the-fly data generation for training rather than static dataset splits.
Hardware Specification No The paper does not specify any particular hardware (GPU, CPU models, etc.) used for running experiments.
Software Dependencies No We train ... using Adam optimizer with fixed learning rate 0.001 and no weight decay. While an optimizer is mentioned, no specific software or library versions are provided.
Experiment Setup Yes At each step, we generate a fresh batch (size 256) of sequences (length 128) via Alg. 1, and train using Adam optimizer with fixed learning rate 0.001 and no weight decay. For the standard transformer embedding size we tested 128, 64 and d QK = 32. For the constructions, we fix β = 100 and λ = 500.