reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Authors: Zeyuan Allen-Zhu, Yuanzhi Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Unlike prior studies that evaluate a model s capability via loss or benchmarks, we estimate information-theoretically the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through multiple controlled datasets, we establish that language models can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications.
Researcher Affiliation	Collaboration	Zeyuan Allen-Zhu FAIR at Meta EMAIL Yuanzhi Li Mohamed bin Zayed University of AI EMAIL
Pseudocode	No	The paper defines theoretical concepts and presents theorems (e.g., Theorem 3.1) but does not include any structured pseudocode or algorithm blocks. It explicitly states, "We omit all technical details in this ICLR 2025 camera-ready version".
Open Source Code	No	Full and future editions of Part 3.3, including additional experiments and potential code releases, are available at physics.allen-zhu.com and ssrn.com/abstract=5250617.
Open Datasets	Yes	We generate synthetic knowledge-only datasets by uniformly at random generating (name, attribute, value) tuples from a knowledge base and converting them into English descriptions. We pretrain language models (e.g., GPT-2, LLa MA, Mistral) on these texts using a standard auto-regressive objective from random initialization, and estimate the learned knowledge. By varying the number of knowledge pieces and model sizes, we outline a knowledge capacity scaling law. ... Allen-Zhu & Li (2024) introduced a synthetic biography dataset comprising N randomly-generated (fake) individuals, each characterized by six attributes: birth date, birth city, university, major, employer, and working city.
Dataset Splits	No	The paper describes how training data is prepared: "Knowledge paragraphs about individuals are randomly concatenated, separated by <EOS> tokens, and then randomly segmented into 512-token windows." However, it does not specify explicit training, validation, or test dataset splits in terms of percentages or sample counts for evaluation.
Hardware Specification	Yes	Training GPT2-20-16 on bio S(10M) for 1000 exposures costs 8.5 days with 64 A100s, while GPT2-12-32 on bio S(20M) for 100 exposures took 2.4 days.
Software Dependencies	No	The paper mentions using "the default Adam W optimizer and mixed-precision fp16" and "The default GPT2Tokenizer is used". It does not provide specific version numbers for software libraries or frameworks like PyTorch, TensorFlow, or Python.
Experiment Setup	Yes	We train language models from scratch (i.e., random initialization) using the specified datasets. Knowledge paragraphs about individuals are randomly concatenated, separated by <EOS> tokens, and then randomly segmented into 512-token windows. The standard autoregressive loss is employed for training. Unless specified otherwise, training utilizes the default Adam W optimizer and mixed-precision fp16. Learning rates and weight decays are moderately tuned (see full paper).