Attention with Markov: A Curious Case of Single-layer Transformers

Authors: Ashok Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, Michael Gastpar

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Leveraging our framework, we theoretically characterize the loss landscape of single-layer transformers and show the existence of global minima (bigram) and bad local minima (unigram) contingent on data properties and model architecture. We precisely delineate the regimes under which these local optima occur. Backed by experiments, we demonstrate that our theoretical findings are in congruence with the empirical results.
Researcher Affiliation Academia Ashok Vardhan Makkuva EPFL Marco Bondaschi EPFL Adway Girish EPFL Alliot Nagle UT Austin Martin Jaggi EPFL Hyeji Kim UT Austin Michael Gastpar EPFL
Pseudocode No The paper provides mathematical descriptions of the transformer architecture but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Code is available at https://github.com/Bond1995/Markov.
Open Datasets No The paper uses a 'k-th order binary Markov source' dataset, which is generated for the experiments as described in Section 3.1: 'we generate sequences {xn}N n=1 (π(p, q), P (p, q)) of length N = 1024'. This is a synthetic dataset generated by the authors, and no pre-existing public dataset or access information for the raw generated data is provided.
Dataset Splits No The paper describes generating synthetic Markov chain sequences for training and refers to 'test loss' in figures, implying a testing phase. However, it does not provide specific details on how the generated data is partitioned into training, validation, or test sets (e.g., percentages, sample counts, or methodology for splitting).
Hardware Specification No The paper does not provide any specific details regarding the hardware (e.g., GPU models, CPU types, or memory) used for conducting the experiments.
Software Dependencies No The paper mentions using 'Optimizer Adam W (β1 = 0.9, β2 = 0.95)' and that the architecture is 'Based on the GPT-2 architecture as implemented in Pagliardini (2023)'. While an optimizer and a base architecture are identified, specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA) are not listed.
Experiment Setup Yes Table 2: Settings and parameters for the transformer model used in the experiments. Batch size Grid-searched in {16, 50} Accumulation steps 1 Optimizer Adam W (β1 = 0.9, β2 = 0.95) Learning rate 0.001 Scheduler Cosine # Iterations 8000 Weight decay 1 10 3 Dropout 0 Sequence length Grid-searched in {512, 1024, 2048} Embedding dimension Grid-searched in {4, 8, 16, 32, 64} Transformer layers Between 1 and 6 depending on the experiment Attention heads Grid-searched in {1, 2, 4, 8} Mask window Between 2 and full causal masking depending on the experiment Repetitions 3 or 5