Scaling Laws for Precision
Authors: Tanishq Kumar, Zachary Ankner, Benjamin Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Re, Aditi Raghunathan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens. Our experiments consist of a sweep of language model pretraining runs over N [30, 60, 110, 220] million parameters (non-embedding) and D [1.5, 3, 6, 13, 26] billion tokens. |
| Researcher Affiliation | Collaboration | 1Harvard University 2Stanford University 3MIT 4Databricks 5Carnegie Mellon University |
| Pseudocode | No | The paper describes methods and derivations but does not include any explicitly labeled pseudocode or algorithm blocks. Procedures are described in paragraph form. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any repository links. |
| Open Datasets | Yes | We train and evaluate a suite of OLMo-style models on the Dolma V1.7 dataset (Groeneveld et al., 2024; Soldaini et al., 2024) |
| Dataset Splits | No | We launch over 20 runs for each (N, D) combination to study scaling in precision, trained and validated on the common crawl split of the Dolma dataset (Soldaini et al., 2024). The paper refers to a 'common crawl split' but does not provide specific percentages or counts for training, validation, or test sets. |
| Hardware Specification | Yes | We train all our models with fake (simulated) quantization on NVidia H100 GPUs to remain hardware agnostic, not taking advantage of any true low-precision computation. |
| Software Dependencies | No | The paper mentions various components and optimizers such as Swi GLU activations, Ro PE embeddings, RMSLayer Norm, and Adam with specific beta and epsilon values. However, it does not provide version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used. |
| Experiment Setup | Yes | We use a standard Transformer++ implementation: Swi GLU activations (Shazeer, 2020), Ro PE embeddings (Su et al., 2021), RMSLayer Norm, Adam β values of (0.9, 0.95). We adopt a cosine learning rate schedule with 10% warmup period and peak learning rate of 6e-4 for the smallest model and learning rates scaled with width and depth according to depth-µP for the larger models (Yang et al., 2022; Bordelon et al., 2023). We use a sequence length of 1024 and batch size of 256 throughout, with Adam ϵ 1e-15, following (Wortsman et al., 2023b). We use weight decay of 0.1... |