How Does Critical Batch Size Scale in Pre-training?

Authors: Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, Sham Kakade

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To operationalize it, we propose a measure of CBS and pre-train a series of auto-regressive language models, ranging from 85 million to 1.2 billion parameters, on the C4 dataset. Through extensive hyper-parameter sweeps and careful control of factors such as batch size, momentum, and learning rate along with its scheduling, we systematically investigate the impact of scale on CBS. Then we fit scaling laws with respect to model and data sizes to decouple their effects. Overall, our results demonstrate that CBS scales primarily with data size rather than model size, a finding we justify theoretically through the analysis of infinite-width limits of neural networks and infinite-dimensional least squares regression.
Researcher Affiliation Collaboration 1Harvard University 2Kempner Institute, Harvard University 3University of California, Berkeley 4The University of Hong Kong 5Amazon
Pseudocode No The paper describes methods and theoretical analysis in prose, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code available at https://github.com/hlzhang109/critical-batch-size.
Open Datasets Yes We train a series of autoregressive LMs with a context length of 512 in different sizes ranging from 85M, 151M, 302M, 604M to 1.2B (Appendix D Table 2) on C4 (Raffel et al., 2020) using Adam (Kingma, 2014)
Dataset Splits No The paper mentions using a "holdout validation set" and discusses "evaluation batches" with their sizes (e.g., 327,680 tokens for evaluation variance), but it does not specify explicit numerical splits (e.g., percentages or exact counts) for the training, validation, or test sets of the C4 dataset.
Hardware Specification Yes We use nodes equipped with 8 A100 GPUs, each with 80Gi B of memory, for model training.
Software Dependencies No The paper mentions using Adam, PyTorch, and the Olmo training suite, but it does not provide specific version numbers for these software components.
Experiment Setup Yes Through extensive hyper-parameter sweeps and careful control of factors such as batch size, momentum, and learning rate along with its scheduling, we systematically investigate the impact of scale on CBS. Table 3: Sweeping experiments settings. Default values after the hyper-parameter search are in Bold font. Bold font means the default hyper-parameters that can closely reproduce our results without extensive tuning. Not bolding implies a full sweep for each model scale. This table lists specific values for Batch size, Learning rate, Warmup fraction, Momentum β1, Adam β2, EWA decay rate τ, Context Length, Grad clipping norm.