reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Physics of Language Models: Part 3.2, Knowledge Manipulation

Authors: Zeyuan Allen-Zhu, Yuanzhi Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our primary contribution is a controlled, synthetic experiment that confirms these weaknesses are inherent to language models: they cannot efficiently manipulate knowledge from pre-training data, even when such knowledge is perfectly stored in the models, despite adequate training and sufficient model size. Our findings also apply to modern pretrained language models such as GPT-4/4o, thus giving rise to many Turing tests to distinguish Humans from contemporary AIs.
Researcher Affiliation	Collaboration	Zeyuan Allen-Zhu FAIR at Meta EMAIL Yuanzhi Li Mohamed bin Zayed University of AI EMAIL
Pseudocode	No	The paper describes methodologies in paragraph form and does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Full and future editions of Part 3.2, including additional experiments and potential code releases, can be found at physics.allen-zhu.com and ssrn.com/abstract=5250621.
Open Datasets	No	To address the unpredictability of internet data, Allen-Zhu & Li (2024; 2025) developed synthetic pretrain data containing controlled biographies for up to N = 20 million individuals.
Dataset Splits	No	During training, the model learns from the biographies of all N individuals and the knowledge manipulation question-answer (QA) texts from a subset of individuals (the in-distribution set Ptrain). We evaluate the model s out-of-distribution (OOD) generation accuracy by testing it on the remaining subset (the out-of-distribution set Ptest), where it has seen the biographies but not the QAs during training.
Hardware Specification	No	The paper mentions models like GPT-4/4o and Llama-3.1-405B but does not specify the hardware (e.g., GPU models, CPU types) used for running experiments or evaluating these models.
Software Dependencies	No	The paper references various language models (e.g., GPT-4, Llama-3.1-405B, Gemini 2.0, Claude 3.5/3.7) but does not provide specific version numbers for software libraries, frameworks, or operating systems used in their experimental setup.
Experiment Setup	No	Technical details are omitted in this ICLR version to encourage readers to refer to our full paper at ssrn.com/abstract=5250621, which will also feature up-to-date experiments on this topic.