Constrained Belief Updates Explain Geometric Structures in Transformer Representations
Authors: Mateusz Piotrowski, Paul M. Riechers, Daniel Filan, Adam Shai
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach yields specific, testable predictions about attention patterns, value vectors, intermediate fractal representations, and final belief-state geometry. We confirm these predictions in trained transformers, demonstrating how the inherently recurrent next-token task is realized by an attention-based, parallelized implementation of Bayesian belief updates. |
| Researcher Affiliation | Collaboration | 1MATS, Berkeley, CA, USA 2Simplex, Astera Institute, Emeryville, CA, USA 3Beyond Institute for Theoretical Science (BITS), San Francisco, CA, USA. Correspondence to: Paul M. Riechers <EMAIL>, Adam S. Shai <EMAIL>. |
| Pseudocode | No | The paper describes algorithms and processes using mathematical equations and textual explanations, but it does not include any clearly labeled pseudocode or algorithm blocks with structured, code-like formatting. |
| Open Source Code | No | Code for analysis of can be found here. |
| Open Datasets | No | Our study focuses on the Mess3 parametrized family of hidden Markov models (Marzen & Crutchfield, 2017), which provide a tractable yet rich setting for studying sequence prediction. ... For each experimental run, we generate sequences by sampling from an HMM with specific (α, x) values. |
| Dataset Splits | No | We train single-layer transformers on next-token prediction using gradient descent, with sequences sampled from our parametrized HMMs as training data. ... We generate all possible input sequences up to length 10, recording hidden activations from the transformer’s residual stream. These activations are organized into a dataset capturing the model’s response to all input patterns. The paper does not specify any explicit training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud resources). |
| Software Dependencies | No | We use Adam optimizer (Kingma & Ba, 2017) with a 10^-4 learning rate and no weight decay. This only mentions an optimizer by name and a citation, not specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We employ a standard single-layer transformer model with learned positional embeddings. The model architecture follows the conventional transformer design, with dmodel = 64 and dff = 256. Depending on the Mess3 parameters, we use either a single-head or a double-head attention mechanism. We conduct a systematic sweep over the HMM parameters α and x, training a separate model for each pair. Models are trained on next-token prediction using cross-entropy loss, with batch size 128. We use Adam optimizer (Kingma & Ba, 2017) with a 10^-4 learning rate and no weight decay. Each model is trained for approximately 15 million tokens. |