reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Private Fine-tuning of Large Language Models with Zeroth-order Optimization

Authors: Xinyu Tang, Ashwinee Panda, Milad Nasr, Saeed Mahloujifar, Prateek Mittal

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first overview our experimental setup in Section 4.1 and then evaluate the performance of DP-ZO in Section 4.2. We find that DP-ZO provides a competitive privacy-utility trade-off for conservative privacy budgets across multiple datasets, model architectures and can scale to large models under conservative privacy budgets. We also compare DP-ZO to DP-SGD in Section 4.2 and show that DP-ZO achieves comparable performance to DP-SGD for the same model size. Furthermore, we show that DP-ZO achieves a non-trivial privacy-utility trade-off under pure ε-DP under a conservative privacy budget like ε = 4 on large language models. In Section 4.3 we first provides results of DP-ZO across different model architectures. We then measure the empirical privacy loss and computation efficiency of DP-ZO.
Researcher Affiliation	Collaboration	Xinyu Tang 1 Ashwinee Panda 1 Milad Nasr2 Saeed Mahloujifar3 Prateek Mittal1 1Princeton University 2Google Deep Mind 3FAIR, Meta
Pseudocode	Yes	Algorithm 1 Differentially Private-ZO Algorithm 2 Differentially Private-ZO (GPU memory efficient version. Adapted from Malladi et al. (2023))
Open Source Code	No	The paper does not provide a direct link to the authors' own implementation of DP-ZO or an explicit statement of code release for the methodology described. It mentions that "Our experiments are based on the open-source code4 of Malladi et al. (2023)" which refers to a third-party open-source project (MeZO), but not the specific code for this paper's contribution.
Open Datasets	Yes	Datasets. Following Malladi et al. (2023), we mainly consider three different benchmark NLP tasks: SQu AD (Rajpurkar et al., 2016) and DROP (Dua et al., 2019) for text generation, and SST2 (Socher et al., 2013) for text classification. We use F1 for text generation and accuracy for text classification as evaluation metric (we include a detailed description of the metric in Appendix D.1). ... In Section 4.3, we follow prior works (Yu et al., 2022; Li et al., 2022b) and run experiments on QNLI (Wang et al., 2019).
Dataset Splits	Yes	Although all these datasets have very different dataset sizes, we consider the few-shot setting for all these datasets where we sample 1000 points for each dataset. ... We also varies the training sample size from the few-shot to the full training set by conducting experiments on the QNLI (Wang et al., 2019) dataset to be consistent with previous works (Li et al., 2022b; Yu et al., 2022) for a fair comparison. ... Table 7: The effect of different n training samples for DP-ZO (Gaussian) fon SQu AD dataset. (1, 10-5)-DP. OPT-13B with Lo RA finetuning. n-shot n = 250 n = 500 n = 1000 n = 5000. ... Table 9: Comparison of DP-ZO and DP-SGD on different n in QNLI on Ro BERTa-base model. (3, 4.7 10-6)-DP. n-shot 1000 5000 10000 50000 104743. For full set, n = 104743 for QNLI and n = 66849 for SST2.
Hardware Specification	Yes	The results of DP-SGD on DROP are omitted because fine-tuning OPT-13B on the DROP dataset by Lo RA will cause the out of memory issue on a single A100 GPU even in the non-private setting. ... OOM indicates out-of-memory on a single A100 80G GPU.
Software Dependencies	No	The paper mentions various software components and libraries like 'Py Torch FSDP', 'Opacus', 'fast DP', 'Microsoft (2021)' (referring to prv_accountant library), and 'Google s-DP-Library (2020)', but it does not specify the version numbers for these ancillary software components as used in their experimental setup.
Experiment Setup	Yes	We detail the full hyper-parameter searches and computation cost in Appendix D.2. ... Table 16: Hyper-parameter search for DP-ZO in main results Table 1. \|D\| 1000 Steps T 75000 Clipping C 0.05 Batch size 16 σ 30.9 for ε = 0.5, 16.4 for ε = 1, 4.8 for ε = 4 learning rate [5e-6, 1e-5, 2e-5, 5e-5, 1e-4] Lo RA rank 8 ϕ 0.01