Physics of Language Models: Part 1, Learning Hierarchical Language Structures

Authors: Zeyuan Allen-Zhu, Yuanzhi Li

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we demonstrate that autoregressive language models such as GPT can accurately learn and reason over these CFG-defined hierarchical languages and generate valid continuations. Analyzing model internals in this controlled setting, we reveal that hidden states linearly encode CFG parse structure, and that attention patterns align closely with the information flow of dynamic-programming parsing algorithms. ... We test the model s accuracy and diversity by feeding it prefixes from the CFG (or no prefix, just the starting token) and observing if it can generate completions.
Researcher Affiliation Collaboration Zeyuan Allen-Zhu EMAIL FAIR at Meta Yuanzhi Li EMAIL Mohamed bin Zayed University of AI
Pseudocode Yes V5 added pseudocode and further clarified the connection to dynamic programming. ... Algorithm 1 the two-step DP to compute the next-token conditional probability
Open Source Code Yes Our data generators and evaluation tools, including accuracy evaluation and ground-truth distribution computation, are open-sourced as part of the package (Allen-Zhu, 2025b). ... Code released at https://github.com/facebookresearch/PhysicsLM4.
Open Datasets Yes We construct seven synthetic CFGs of depth L = 7 detailed in Section A.1. ... We derive the English CFG from the Penn Tree Bank (PTB) dataset (Marcus et al., 1993).
Dataset Splits No Throughout the experiments, for both pre-training and testing, we only use fresh samples from the CFG datasets (thus using 4.9 billion tokens = 96 512 100k).
Hardware Specification Yes We test our results using a mixture of V100 and A100 GPUs (on A100, pretraining a model takes less than a day using 4GPUs), even when using float32.
Software Dependencies No The paper mentions 'huggingface library' and 'Ro PE implementation from the GPT-Neo X-20B project' and 'relative attention framework from De BERTa' but does not specify any version numbers for these software components. It also mentions 'Adam W' which is an optimizer, not a software library itself with a version.
Experiment Setup Yes For GPT pre-training, we use Adam W with β = (0.9, 0.98), weight decay 0.1, learning rate 0.0003, and batch size 96. We pre-train the model for 100k iterations, with a linear learning rate decay. ... For our probing method... We again use Adam W with β = (0.9, 0.98) but this time with learning rate 0.003, weight decay 0.001, batch size 60 and train for 30k iterations.