reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Laws for Precision

Authors: Tanishq Kumar, Zachary Ankner, Benjamin Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Re, Aditi Raghunathan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens. Our experiments consist of a sweep of language model pretraining runs over N [30, 60, 110, 220] million parameters (non-embedding) and D [1.5, 3, 6, 13, 26] billion tokens.
Researcher Affiliation	Collaboration	1Harvard University 2Stanford University 3MIT 4Databricks 5Carnegie Mellon University
Pseudocode	No	The paper describes methods and derivations but does not include any explicitly labeled pseudocode or algorithm blocks. Procedures are described in paragraph form.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any repository links.
Open Datasets	Yes	We train and evaluate a suite of OLMo-style models on the Dolma V1.7 dataset (Groeneveld et al., 2024; Soldaini et al., 2024)
Dataset Splits	No	We launch over 20 runs for each (N, D) combination to study scaling in precision, trained and validated on the common crawl split of the Dolma dataset (Soldaini et al., 2024). The paper refers to a 'common crawl split' but does not provide specific percentages or counts for training, validation, or test sets.
Hardware Specification	Yes	We train all our models with fake (simulated) quantization on NVidia H100 GPUs to remain hardware agnostic, not taking advantage of any true low-precision computation.
Software Dependencies	No	The paper mentions various components and optimizers such as Swi GLU activations, Ro PE embeddings, RMSLayer Norm, and Adam with specific beta and epsilon values. However, it does not provide version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup	Yes	We use a standard Transformer++ implementation: Swi GLU activations (Shazeer, 2020), Ro PE embeddings (Su et al., 2021), RMSLayer Norm, Adam β values of (0.9, 0.95). We adopt a cosine learning rate schedule with 10% warmup period and peak learning rate of 6e-4 for the smallest model and learning rates scaled with width and depth according to depth-µP for the larger models (Yang et al., 2022; Bordelon et al., 2023). We use a sequence length of 1024 and batch size of 256 throughout, with Adam ϵ 1e-15, following (Wortsman et al., 2023b). We use weight decay of 0.1...