reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

How Does Critical Batch Size Scale in Pre-training?

Authors: Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, Sham Kakade

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To operationalize it, we propose a measure of CBS and pre-train a series of auto-regressive language models, ranging from 85 million to 1.2 billion parameters, on the C4 dataset. Through extensive hyper-parameter sweeps and careful control of factors such as batch size, momentum, and learning rate along with its scheduling, we systematically investigate the impact of scale on CBS. Then we fit scaling laws with respect to model and data sizes to decouple their effects. Overall, our results demonstrate that CBS scales primarily with data size rather than model size, a finding we justify theoretically through the analysis of infinite-width limits of neural networks and infinite-dimensional least squares regression.
Researcher Affiliation	Collaboration	1Harvard University 2Kempner Institute, Harvard University 3University of California, Berkeley 4The University of Hong Kong 5Amazon
Pseudocode	No	The paper describes methods and theoretical analysis in prose, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code available at https://github.com/hlzhang109/critical-batch-size.
Open Datasets	Yes	We train a series of autoregressive LMs with a context length of 512 in different sizes ranging from 85M, 151M, 302M, 604M to 1.2B (Appendix D Table 2) on C4 (Raffel et al., 2020) using Adam (Kingma, 2014)
Dataset Splits	No	The paper mentions using a "holdout validation set" and discusses "evaluation batches" with their sizes (e.g., 327,680 tokens for evaluation variance), but it does not specify explicit numerical splits (e.g., percentages or exact counts) for the training, validation, or test sets of the C4 dataset.
Hardware Specification	Yes	We use nodes equipped with 8 A100 GPUs, each with 80Gi B of memory, for model training.
Software Dependencies	No	The paper mentions using Adam, PyTorch, and the Olmo training suite, but it does not provide specific version numbers for these software components.
Experiment Setup	Yes	Through extensive hyper-parameter sweeps and careful control of factors such as batch size, momentum, and learning rate along with its scheduling, we systematically investigate the impact of scale on CBS. Table 3: Sweeping experiments settings. Default values after the hyper-parameter search are in Bold font. Bold font means the default hyper-parameters that can closely reproduce our results without extensive tuning. Not bolding implies a full sweep for each model scale. This table lists specific values for Batch size, Learning rate, Warmup fraction, Momentum β1, Adam β2, EWA decay rate τ, Context Length, Grad clipping norm.