Scaling Laws for Precision

Authors: Tanishq Kumar, Zachary Ankner, Benjamin Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Re, Aditi Raghunathan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens. Our experiments consist of a sweep of language model pretraining runs over N [30, 60, 110, 220] million parameters (non-embedding) and D [1.5, 3, 6, 13, 26] billion tokens.
Researcher Affiliation Collaboration 1Harvard University 2Stanford University 3MIT 4Databricks 5Carnegie Mellon University
Pseudocode No The paper describes methods and derivations but does not include any explicitly labeled pseudocode or algorithm blocks. Procedures are described in paragraph form.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any repository links.
Open Datasets Yes We train and evaluate a suite of OLMo-style models on the Dolma V1.7 dataset (Groeneveld et al., 2024; Soldaini et al., 2024)
Dataset Splits No We launch over 20 runs for each (N, D) combination to study scaling in precision, trained and validated on the common crawl split of the Dolma dataset (Soldaini et al., 2024). The paper refers to a 'common crawl split' but does not provide specific percentages or counts for training, validation, or test sets.
Hardware Specification Yes We train all our models with fake (simulated) quantization on NVidia H100 GPUs to remain hardware agnostic, not taking advantage of any true low-precision computation.
Software Dependencies No The paper mentions various components and optimizers such as Swi GLU activations, Ro PE embeddings, RMSLayer Norm, and Adam with specific beta and epsilon values. However, it does not provide version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup Yes We use a standard Transformer++ implementation: Swi GLU activations (Shazeer, 2020), Ro PE embeddings (Su et al., 2021), RMSLayer Norm, Adam β values of (0.9, 0.95). We adopt a cosine learning rate schedule with 10% warmup period and peak learning rate of 6e-4 for the smallest model and learning rates scaled with width and depth according to depth-µP for the larger models (Yang et al., 2022; Bordelon et al., 2023). We use a sequence length of 1024 and batch size of 256 throughout, with Adam ϵ 1e-15, following (Wortsman et al., 2023b). We use weight decay of 0.1...