reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Constrained Belief Updates Explain Geometric Structures in Transformer Representations

Authors: Mateusz Piotrowski, Paul M. Riechers, Daniel Filan, Adam Shai

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach yields specific, testable predictions about attention patterns, value vectors, intermediate fractal representations, and final belief-state geometry. We confirm these predictions in trained transformers, demonstrating how the inherently recurrent next-token task is realized by an attention-based, parallelized implementation of Bayesian belief updates.
Researcher Affiliation	Collaboration	1MATS, Berkeley, CA, USA 2Simplex, Astera Institute, Emeryville, CA, USA 3Beyond Institute for Theoretical Science (BITS), San Francisco, CA, USA. Correspondence to: Paul M. Riechers <EMAIL>, Adam S. Shai <EMAIL>.
Pseudocode	No	The paper describes algorithms and processes using mathematical equations and textual explanations, but it does not include any clearly labeled pseudocode or algorithm blocks with structured, code-like formatting.
Open Source Code	No	Code for analysis of can be found here.
Open Datasets	No	Our study focuses on the Mess3 parametrized family of hidden Markov models (Marzen & Crutchfield, 2017), which provide a tractable yet rich setting for studying sequence prediction. ... For each experimental run, we generate sequences by sampling from an HMM with specific (α, x) values.
Dataset Splits	No	We train single-layer transformers on next-token prediction using gradient descent, with sequences sampled from our parametrized HMMs as training data. ... We generate all possible input sequences up to length 10, recording hidden activations from the transformer’s residual stream. These activations are organized into a dataset capturing the model’s response to all input patterns. The paper does not specify any explicit training, validation, or test dataset splits.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud resources).
Software Dependencies	No	We use Adam optimizer (Kingma & Ba, 2017) with a 10^-4 learning rate and no weight decay. This only mentions an optimizer by name and a citation, not specific version numbers for any software dependencies.
Experiment Setup	Yes	We employ a standard single-layer transformer model with learned positional embeddings. The model architecture follows the conventional transformer design, with dmodel = 64 and dff = 256. Depending on the Mess3 parameters, we use either a single-head or a double-head attention mechanism. We conduct a systematic sweep over the HMM parameters α and x, training a separate model for each pair. Models are trained on next-token prediction using cross-entropy loss, with batch size 128. We use Adam optimizer (Kingma & Ba, 2017) with a 10^-4 learning rate and no weight decay. Each model is trained for approximately 15 million tokens.