Constrained Belief Updates Explain Geometric Structures in Transformer Representations

Authors: Mateusz Piotrowski, Paul M. Riechers, Daniel Filan, Adam Shai

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach yields specific, testable predictions about attention patterns, value vectors, intermediate fractal representations, and final belief-state geometry. We confirm these predictions in trained transformers, demonstrating how the inherently recurrent next-token task is realized by an attention-based, parallelized implementation of Bayesian belief updates.
Researcher Affiliation Collaboration 1MATS, Berkeley, CA, USA 2Simplex, Astera Institute, Emeryville, CA, USA 3Beyond Institute for Theoretical Science (BITS), San Francisco, CA, USA. Correspondence to: Paul M. Riechers <EMAIL>, Adam S. Shai <EMAIL>.
Pseudocode No The paper describes algorithms and processes using mathematical equations and textual explanations, but it does not include any clearly labeled pseudocode or algorithm blocks with structured, code-like formatting.
Open Source Code No Code for analysis of can be found here.
Open Datasets No Our study focuses on the Mess3 parametrized family of hidden Markov models (Marzen & Crutchfield, 2017), which provide a tractable yet rich setting for studying sequence prediction. ... For each experimental run, we generate sequences by sampling from an HMM with specific (α, x) values.
Dataset Splits No We train single-layer transformers on next-token prediction using gradient descent, with sequences sampled from our parametrized HMMs as training data. ... We generate all possible input sequences up to length 10, recording hidden activations from the transformer’s residual stream. These activations are organized into a dataset capturing the model’s response to all input patterns. The paper does not specify any explicit training, validation, or test dataset splits.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud resources).
Software Dependencies No We use Adam optimizer (Kingma & Ba, 2017) with a 10^-4 learning rate and no weight decay. This only mentions an optimizer by name and a citation, not specific version numbers for any software dependencies.
Experiment Setup Yes We employ a standard single-layer transformer model with learned positional embeddings. The model architecture follows the conventional transformer design, with dmodel = 64 and dff = 256. Depending on the Mess3 parameters, we use either a single-head or a double-head attention mechanism. We conduct a systematic sweep over the HMM parameters α and x, training a separate model for each pair. Models are trained on next-token prediction using cross-entropy loss, with batch size 128. We use Adam optimizer (Kingma & Ba, 2017) with a 10^-4 learning rate and no weight decay. Each model is trained for approximately 15 million tokens.