(How) Do Language Models Track State?
Authors: Belinda Z. Li, Zifan Carl Guo, Jacob Andreas
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study state tracking in LMs trained or fine-tuned to compose permutations (i.e., to compute the order of a set of objects after a sequence of swaps). ... We show that LMs consistently learn one of two state tracking mechanisms for this task. ... Finally, 4 and 5 present experimental findings. Across a range of sizes, architectures, and pretraining schemes, we find that LMs consistently learn one of two state tracking mechanisms. |
| Researcher Affiliation | Academia | 1MIT EECS and CSAIL. Correspondence to: Belinda Z. Li <EMAIL>. |
| Pseudocode | Yes | 3.1. Sequential Algorithm ht,0 = at t // initialize actions (h0,0 = st) // by definition; see 2.2 for t = 1..T, l = 1..L do if l < t then ht,l = ht,l 1 = at // propagate actions if l = t then ht,l = ht 1,l 1ht,l 1 = st 1at = st // update states if l > t then ht,l = ht,l 1 = st // propagate states end for |
| Open Source Code | Yes | Code and data are available at https: //github.com/belindal/state-tracking. |
| Open Datasets | Yes | We generate 1 million unique length-100 sequences of permutations in both S3 and S5. ... Except where noted, we begin with Pythia-160M models pre-trained on the Pile dataset (Biderman et al., 2023). |
| Dataset Splits | Yes | We generate 1 million unique length-100 sequences of permutations in both S3 and S5. We split the data 90/10 for training/analysis, and fine-tune these models (using a crossentropy loss) to predict the state corresponding to each prefix of each action sequence: |
| Hardware Specification | No | For larger models (above 700M parameters), we train using bfloat16. No specific hardware models (e.g., GPU or CPU names) are provided. |
| Software Dependencies | No | The paper mentions 'Adam W optimizer' and 'Pythia-160M models pre-trained on the Pile dataset' and 'GPT-2', but does not provide specific version numbers for underlying software libraries like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | Regardless of initialization scheme, we fine-tune models for 20 epochs on Equation (3) using the Adam W optimizer with learning rate 5e-5 and batch size 128. |