Physics of Language Models: Part 3.2, Knowledge Manipulation
Authors: Zeyuan Allen-Zhu, Yuanzhi Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our primary contribution is a controlled, synthetic experiment that confirms these weaknesses are inherent to language models: they cannot efficiently manipulate knowledge from pre-training data, even when such knowledge is perfectly stored in the models, despite adequate training and sufficient model size. Our findings also apply to modern pretrained language models such as GPT-4/4o, thus giving rise to many Turing tests to distinguish Humans from contemporary AIs. |
| Researcher Affiliation | Collaboration | Zeyuan Allen-Zhu FAIR at Meta EMAIL Yuanzhi Li Mohamed bin Zayed University of AI EMAIL |
| Pseudocode | No | The paper describes methodologies in paragraph form and does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Full and future editions of Part 3.2, including additional experiments and potential code releases, can be found at physics.allen-zhu.com and ssrn.com/abstract=5250621. |
| Open Datasets | No | To address the unpredictability of internet data, Allen-Zhu & Li (2024; 2025) developed synthetic pretrain data containing controlled biographies for up to N = 20 million individuals. |
| Dataset Splits | No | During training, the model learns from the biographies of all N individuals and the knowledge manipulation question-answer (QA) texts from a subset of individuals (the in-distribution set Ptrain). We evaluate the model s out-of-distribution (OOD) generation accuracy by testing it on the remaining subset (the out-of-distribution set Ptest), where it has seen the biographies but not the QAs during training. |
| Hardware Specification | No | The paper mentions models like GPT-4/4o and Llama-3.1-405B but does not specify the hardware (e.g., GPU models, CPU types) used for running experiments or evaluating these models. |
| Software Dependencies | No | The paper references various language models (e.g., GPT-4, Llama-3.1-405B, Gemini 2.0, Claude 3.5/3.7) but does not provide specific version numbers for software libraries, frameworks, or operating systems used in their experimental setup. |
| Experiment Setup | No | Technical details are omitted in this ICLR version to encourage readers to refer to our full paper at ssrn.com/abstract=5250621, which will also feature up-to-date experiments on this topic. |