reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Attention with Markov: A Curious Case of Single-layer Transformers

Authors: Ashok Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, Michael Gastpar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Leveraging our framework, we theoretically characterize the loss landscape of single-layer transformers and show the existence of global minima (bigram) and bad local minima (unigram) contingent on data properties and model architecture. We precisely delineate the regimes under which these local optima occur. Backed by experiments, we demonstrate that our theoretical findings are in congruence with the empirical results.
Researcher Affiliation	Academia	Ashok Vardhan Makkuva EPFL Marco Bondaschi EPFL Adway Girish EPFL Alliot Nagle UT Austin Martin Jaggi EPFL Hyeji Kim UT Austin Michael Gastpar EPFL
Pseudocode	No	The paper provides mathematical descriptions of the transformer architecture but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code is available at https://github.com/Bond1995/Markov.
Open Datasets	No	The paper uses a 'k-th order binary Markov source' dataset, which is generated for the experiments as described in Section 3.1: 'we generate sequences {xn}N n=1 (π(p, q), P (p, q)) of length N = 1024'. This is a synthetic dataset generated by the authors, and no pre-existing public dataset or access information for the raw generated data is provided.
Dataset Splits	No	The paper describes generating synthetic Markov chain sequences for training and refers to 'test loss' in figures, implying a testing phase. However, it does not provide specific details on how the generated data is partitioned into training, validation, or test sets (e.g., percentages, sample counts, or methodology for splitting).
Hardware Specification	No	The paper does not provide any specific details regarding the hardware (e.g., GPU models, CPU types, or memory) used for conducting the experiments.
Software Dependencies	No	The paper mentions using 'Optimizer Adam W (β1 = 0.9, β2 = 0.95)' and that the architecture is 'Based on the GPT-2 architecture as implemented in Pagliardini (2023)'. While an optimizer and a base architecture are identified, specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA) are not listed.
Experiment Setup	Yes	Table 2: Settings and parameters for the transformer model used in the experiments. Batch size Grid-searched in {16, 50} Accumulation steps 1 Optimizer Adam W (β1 = 0.9, β2 = 0.95) Learning rate 0.001 Scheduler Cosine # Iterations 8000 Weight decay 1 10 3 Dropout 0 Sequence length Grid-searched in {512, 1024, 2048} Embedding dimension Grid-searched in {4, 8, 16, 32, 64} Transformer layers Between 1 and 6 depending on the experiment Attention heads Grid-searched in {1, 2, 4, 8} Mask window Between 2 and full causal masking depending on the experiment Repetitions 3 or 5