reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Selective induction Heads: How Transformers Select Causal Structures in Context

Authors: Francesco D'Angelo, francesco croce, Nicolas Flammarion

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that attention-only transformers learn this mechanism to predict the next token by identifying the correct lag and copying the corresponding token from the past. We conduct a series of experiments to empirically validate our construction and determine whether transformers trained via gradient descent learns it.
Researcher Affiliation	Academia	Theory of Machine Learning Lab, EPFL, Lausanne, Switzerland
Pseudocode	Yes	Algorithm 1 Generate Dataset of N Sequences from Interleaved Markov Chains
Open Source Code	No	The paper does not provide any statement about releasing source code or a link to a code repository.
Open Datasets	No	The framework consists of sequences of length T on a finite alphabet of tokens S, generated by K distinct sources... Algorithm 1 Generate Dataset of N Sequences from Interleaved Markov Chains. The data is synthetically generated according to Algorithm 1, not a pre-existing publicly available dataset with a link or DOI.
Dataset Splits	No	At each step, we generate a fresh batch (size 256) of sequences (length 128) via Alg. 1, and train using Adam optimizer with fixed learning rate 0.001 and no weight decay. The paper describes on-the-fly data generation for training rather than static dataset splits.
Hardware Specification	No	The paper does not specify any particular hardware (GPU, CPU models, etc.) used for running experiments.
Software Dependencies	No	We train ... using Adam optimizer with fixed learning rate 0.001 and no weight decay. While an optimizer is mentioned, no specific software or library versions are provided.
Experiment Setup	Yes	At each step, we generate a fresh batch (size 256) of sequences (length 128) via Alg. 1, and train using Adam optimizer with fixed learning rate 0.001 and no weight decay. For the standard transformer embedding size we tested 128, 64 and d QK = 32. For the constructions, we fix β = 100 and λ = 500.