Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Authors: Zeyuan Allen-Zhu, Yuanzhi Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Unlike prior studies that evaluate a model s capability via loss or benchmarks, we estimate information-theoretically the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through multiple controlled datasets, we establish that language models can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications.
Researcher Affiliation Collaboration Zeyuan Allen-Zhu FAIR at Meta EMAIL Yuanzhi Li Mohamed bin Zayed University of AI EMAIL
Pseudocode No The paper defines theoretical concepts and presents theorems (e.g., Theorem 3.1) but does not include any structured pseudocode or algorithm blocks. It explicitly states, "We omit all technical details in this ICLR 2025 camera-ready version".
Open Source Code No Full and future editions of Part 3.3, including additional experiments and potential code releases, are available at physics.allen-zhu.com and ssrn.com/abstract=5250617.
Open Datasets Yes We generate synthetic knowledge-only datasets by uniformly at random generating (name, attribute, value) tuples from a knowledge base and converting them into English descriptions. We pretrain language models (e.g., GPT-2, LLa MA, Mistral) on these texts using a standard auto-regressive objective from random initialization, and estimate the learned knowledge. By varying the number of knowledge pieces and model sizes, we outline a knowledge capacity scaling law. ... Allen-Zhu & Li (2024) introduced a synthetic biography dataset comprising N randomly-generated (fake) individuals, each characterized by six attributes: birth date, birth city, university, major, employer, and working city.
Dataset Splits No The paper describes how training data is prepared: "Knowledge paragraphs about individuals are randomly concatenated, separated by <EOS> tokens, and then randomly segmented into 512-token windows." However, it does not specify explicit training, validation, or test dataset splits in terms of percentages or sample counts for evaluation.
Hardware Specification Yes Training GPT2-20-16 on bio S(10M) for 1000 exposures costs 8.5 days with 64 A100s, while GPT2-12-32 on bio S(20M) for 100 exposures took 2.4 days.
Software Dependencies No The paper mentions using "the default Adam W optimizer and mixed-precision fp16" and "The default GPT2Tokenizer is used". It does not provide specific version numbers for software libraries or frameworks like PyTorch, TensorFlow, or Python.
Experiment Setup Yes We train language models from scratch (i.e., random initialization) using the specified datasets. Knowledge paragraphs about individuals are randomly concatenated, separated by <EOS> tokens, and then randomly segmented into 512-token windows. The standard autoregressive loss is employed for training. Unless specified otherwise, training utilizes the default Adam W optimizer and mixed-precision fp16. Learning rates and weight decays are moderately tuned (see full paper).