reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Empirical Privacy Variance

Authors: Yuzheng Hu, Fan Wu, Ruicheng Xian, Yuhang Liu, Lydia Zakynthinou, Pritish Kamath, Chiyuan Zhang, David Forsyth

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate the generality of this phenomenon across multiple dimensions and discuss why it is surprising and relevant. Through regression analysis, we examine how individual and composite hyperparameters influence empirical privacy. The results reveal a no-free-lunch trade-off: existing practices of hyperparameter tuning in DP-SGD, which focus on optimizing utility under a fixed privacy budget, often come at the expense of empirical privacy. To address this, we propose refined heuristics for hyperparameter selection that explicitly account for empirical privacy, showing that they are both precise and practically useful.
Researcher Affiliation	Collaboration	1Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign 2 Institute of Automation, Chinese Academy of Sciences 3Department of Electrical Engineering and Computer Science, University of California Berkeley 4Google Research. Correspondence to: Yuzheng Hu <EMAIL>, Fan Wu <EMAIL>.
Pseudocode	Yes	A. DP-SGD and DP-Adam For completeness, we offer a full description of DP-SGD and DP-Adam in Alg. 2 and Alg. 3. We note that our implementation uses shuffling-based samplers instead of Poisson subsampling. Algorithm 2 Differentially Private Stochastic Gradient Descent (DP-SGD) (Abadi et al., 2016) ... Algorithm 3 DP-Adam (Li et al., 2022) ... Algorithm 4 Adam Update (Kingma & Ba, 2015)
Open Source Code	Yes	Our code is publicly available at https://github.com/empvv/empirical-privacy-variance.
Open Datasets	Yes	Our experimental framework consists of two main steps: 1) fine-tuning an LLM on a dataset using DP-SGD, and 2) evaluating the empirical privacy (formally defined shortly) of the resulting model. We base our study on two sets of experiments. In the first, we fine-tune GPT-2 models (-small (S) and -large (L); Radford et al., 2019) on Enron Email (Cohen, 2004). In the second, we fine-tune Llama-2 models (-7b and -13b; Touvron et al., 2023) on TOFU (Maini et al., 2024).
Dataset Splits	Yes	Step 4: We split the dataset into train, validation, and test sets. We extract a list of secrets from the training set (see Appendix B.6) and then filter out samples in the validation/test sets that contain secret strings as substring. The resulting final train/validation/test size is 33,508/2,725/1,279. ... Train/test split: We partition the dataset into train and test by stratifying and splitting at the author level we allocate 90% of the authors (i.e., sample [0,3600)) to the train set and the remaining 10% (i.e., sample [3600,4000)) to the test set, ensuring that the two sets contain non-overlapping author identities.
Hardware Specification	Yes	We run our experiments on three main computing environments. The first setup has four NVIDIA H100 GPUs (each with 80GB of HBM3 memory) and an Intel Xeon Platinum 8468 CPU (192 cores). The second setup also has four NVIDIA A100 GPUs (each with 80GB) and an AMD EPYC 7643 CPU (192 cores). For larger-scale experiments, we use a cluster consisting of 100 nodes, where each node contains four 40GB A100 GPUs.
Software Dependencies	Yes	We use lm() in R Statistical Software (v4.4.2) (R Core Team, 2024) to perform multivariate regression, where the target y is the empirical privacy score and the covariates are the hyperparameters b, T, η.
Experiment Setup	Yes	DP-SGD (Abadi et al., 2016) is the go-to algorithm for achieving DP in deep learning and has been applied across diverse applications (De et al., 2022; Yu et al., 2022; Xu et al., 2023; Hu et al., 2024). It involves the following training hyperparameters: b (batch size), T (number of training iterations), η (learning rate), c (clipping norm). ... We perform extensive hyperparameter tuning in the space of (b, T, η), while fixing c to a small constant, as we find that varying it within the recommended range (Li et al., 2022; De et al., 2022) has minimal impact on utility or empirical privacy.