reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Physics of Language Models: Part 1, Learning Hierarchical Language Structures

Authors: Zeyuan Allen-Zhu, Yuanzhi Li

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	we demonstrate that autoregressive language models such as GPT can accurately learn and reason over these CFG-defined hierarchical languages and generate valid continuations. Analyzing model internals in this controlled setting, we reveal that hidden states linearly encode CFG parse structure, and that attention patterns align closely with the information flow of dynamic-programming parsing algorithms. ... We test the model s accuracy and diversity by feeding it prefixes from the CFG (or no prefix, just the starting token) and observing if it can generate completions.
Researcher Affiliation	Collaboration	Zeyuan Allen-Zhu EMAIL FAIR at Meta Yuanzhi Li EMAIL Mohamed bin Zayed University of AI
Pseudocode	Yes	V5 added pseudocode and further clarified the connection to dynamic programming. ... Algorithm 1 the two-step DP to compute the next-token conditional probability
Open Source Code	Yes	Our data generators and evaluation tools, including accuracy evaluation and ground-truth distribution computation, are open-sourced as part of the package (Allen-Zhu, 2025b). ... Code released at https://github.com/facebookresearch/PhysicsLM4.
Open Datasets	Yes	We construct seven synthetic CFGs of depth L = 7 detailed in Section A.1. ... We derive the English CFG from the Penn Tree Bank (PTB) dataset (Marcus et al., 1993).
Dataset Splits	No	Throughout the experiments, for both pre-training and testing, we only use fresh samples from the CFG datasets (thus using 4.9 billion tokens = 96 512 100k).
Hardware Specification	Yes	We test our results using a mixture of V100 and A100 GPUs (on A100, pretraining a model takes less than a day using 4GPUs), even when using float32.
Software Dependencies	No	The paper mentions 'huggingface library' and 'Ro PE implementation from the GPT-Neo X-20B project' and 'relative attention framework from De BERTa' but does not specify any version numbers for these software components. It also mentions 'Adam W' which is an optimizer, not a software library itself with a version.
Experiment Setup	Yes	For GPT pre-training, we use Adam W with β = (0.9, 0.98), weight decay 0.1, learning rate 0.0003, and batch size 96. We pre-train the model for 100k iterations, with a linear learning rate decay. ... For our probing method... We again use Adam W with β = (0.9, 0.98) but this time with learning rate 0.003, weight decay 0.001, batch size 60 and train for 30k iterations.