reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Laws for Differentially Private Language Models

Authors: Ryan Mckenna, Yangsibo Huang, Amer Sinha, Borja Balle, Zachary Charles, Christopher A. Choquette-Choo, Badih Ghazi, Georgios Kaissis, Ravi Kumar, Ruibo Liu, Da Yu, Chiyuan Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a rigorous set of experiments, we empirically model this trade-off, and provide a thorough analysis of these experimental results to answer a number of scaling law-style questions, finding (among other things) that: The compute budget allocation predicted by non-private scaling laws is far from optimal under DP, even for huge privacy budgets, confirming the need for our study.
Researcher Affiliation	Industry	1Google Research 2Google Deep Mind. Correspondence to: Ryan Mc Kenna <EMAIL>.
Pseudocode	Yes	Algorithm 1 (Informal) Generalized DP-SGD. Appendix B.1 discusses the informalities. Input: Dataset D, noise-batch ratio σ, (expected) batch size B, iterations T Output: Model parameters θ. Initialize model parameters θ0 RM
Open Source Code	No	The paper references third-party libraries like 'dp_accounting library' (Google DP Team, 2022) and 'Nano DO' (Liu et al., 2024), and frameworks like 'JAX', but does not explicitly provide a link to source code developed by the authors for the methodology described in this paper, nor an unambiguous statement of its release.
Open Datasets	Yes	Models and Datasets. We train BERT models ranging in scale from Tiny (4M parameters) to Mega (778M parameters), summarized in Table 1. We focus on the default BERT dataset, which includes approximately 3.3B words (Zhu et al., 2015; Devlin et al., 2019) before tokenization.
Dataset Splits	No	The paper discusses 'training loss' and implicitly mentions 'evaluation loss' as being estimated by training loss due to training for less than a single epoch. It states 'Each example is truncated or padded as necessary to a sequence of fixed length S = 512.' and 'We measured the loss of the final trained model on 1M examples from the training set.' However, it does not explicitly specify distinct training, validation, or test dataset splits, their percentages, or how they were derived for reproduction purposes.
Hardware Specification	Yes	We utilize TPUv3 pods to run all experiments, and configured the models to use pure data parallelism, using more cores for larger models so that each experiment finishes within four to ten hours. Bert Tiny was trained on 16 TPUv3 cores, while Bert Large was trained on 128. Table 4 provides the training throughputs for all models in our experiments.
Software Dependencies	Yes	The dp_accounting library provides functions that can efficiently and tightly compute the minimum value of σ as a function of ϵ, δ, N, and B (Google DP Team, 2022). We compute the noise-batch ratio for different settings by using the dp_accounting library (Google DP Team, 2022).
Experiment Setup	Yes	Optimizer. We use DP-Adam throughout. We use 1000 steps of learning rate warm-up, followed by exponential learning rate decay, decreasing the learning rate by a factor of 10 over a horizon of 128K iterations. We use per-example clipping with an ℓ2 clip norm of 1.0 across all experiments. Learning Rates. We tune the learning rate with per-example gradient clipping but no noise, finding that the optimal learning rate is consistently 2 7 across all model scales. With noise, we consider three learning rates: 2 7, 2 8, 2 9. Batch Sizes. We use a fixed physical batch size of 1024 across all experiments. Noise-Batch Ratio. We consider 18 values of noise-batch ratio: {2 k \| k = 6, . . . , 23}, plus a baseline value of 0 corresponding to non-private training. Metrics. Every 100 training iterations, we record the average training loss over the previous 100 iterations (or 102, 400 training examples).