Scaling Laws for Differentially Private Language Models
Authors: Ryan Mckenna, Yangsibo Huang, Amer Sinha, Borja Balle, Zachary Charles, Christopher A. Choquette-Choo, Badih Ghazi, Georgios Kaissis, Ravi Kumar, Ruibo Liu, Da Yu, Chiyuan Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a rigorous set of experiments, we empirically model this trade-off, and provide a thorough analysis of these experimental results to answer a number of scaling law-style questions, finding (among other things) that: The compute budget allocation predicted by non-private scaling laws is far from optimal under DP, even for huge privacy budgets, confirming the need for our study. |
| Researcher Affiliation | Industry | 1Google Research 2Google Deep Mind. Correspondence to: Ryan Mc Kenna <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 (Informal) Generalized DP-SGD. Appendix B.1 discusses the informalities. Input: Dataset D, noise-batch ratio σ, (expected) batch size B, iterations T Output: Model parameters θ. Initialize model parameters θ0 RM |
| Open Source Code | No | The paper references third-party libraries like 'dp_accounting library' (Google DP Team, 2022) and 'Nano DO' (Liu et al., 2024), and frameworks like 'JAX', but does not explicitly provide a link to source code developed by the authors for the methodology described in this paper, nor an unambiguous statement of its release. |
| Open Datasets | Yes | Models and Datasets. We train BERT models ranging in scale from Tiny (4M parameters) to Mega (778M parameters), summarized in Table 1. We focus on the default BERT dataset, which includes approximately 3.3B words (Zhu et al., 2015; Devlin et al., 2019) before tokenization. |
| Dataset Splits | No | The paper discusses 'training loss' and implicitly mentions 'evaluation loss' as being estimated by training loss due to training for less than a single epoch. It states 'Each example is truncated or padded as necessary to a sequence of fixed length S = 512.' and 'We measured the loss of the final trained model on 1M examples from the training set.' However, it does not explicitly specify distinct training, validation, or test dataset splits, their percentages, or how they were derived for reproduction purposes. |
| Hardware Specification | Yes | We utilize TPUv3 pods to run all experiments, and configured the models to use pure data parallelism, using more cores for larger models so that each experiment finishes within four to ten hours. Bert Tiny was trained on 16 TPUv3 cores, while Bert Large was trained on 128. Table 4 provides the training throughputs for all models in our experiments. |
| Software Dependencies | Yes | The dp_accounting library provides functions that can efficiently and tightly compute the minimum value of σ as a function of ϵ, δ, N, and B (Google DP Team, 2022). We compute the noise-batch ratio for different settings by using the dp_accounting library (Google DP Team, 2022). |
| Experiment Setup | Yes | Optimizer. We use DP-Adam throughout. We use 1000 steps of learning rate warm-up, followed by exponential learning rate decay, decreasing the learning rate by a factor of 10 over a horizon of 128K iterations. We use per-example clipping with an ℓ2 clip norm of 1.0 across all experiments. Learning Rates. We tune the learning rate with per-example gradient clipping but no noise, finding that the optimal learning rate is consistently 2 7 across all model scales. With noise, we consider three learning rates: 2 7, 2 8, 2 9. Batch Sizes. We use a fixed physical batch size of 1024 across all experiments. Noise-Batch Ratio. We consider 18 values of noise-batch ratio: {2 k | k = 6, . . . , 23}, plus a baseline value of 0 corresponding to non-private training. Metrics. Every 100 training iterations, we record the average training loss over the previous 100 iterations (or 102, 400 training examples). |