Synthesizing Privacy-Preserving Text Data via Finetuning *without* Finetuning Billion-Scale LLMs
Authors: Bowen Tan, Zheng Xu, Eric Xing, Zhiting Hu, Shanshan Wu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Google Research 3Mohamed bin Zayed University of Artificial Intelligence 4UC San Diego. Correspondence to: Bowen Tan, Shanshan Wu <EMAIL, EMAIL>. |
| Pseudocode | Yes | H. Pseudo-code. We provide pseudo-code of our framework corresponding to Section 3 and Figure 1. Algorithm 1 A. Pretraining with Public Corpora CTCL-Topic Development ... Algorithm 2 B. Learning the Private Domain Private Topic Histogram Construction ... Algorithm 3 C. Data Synthesis |
| Open Source Code | Yes | Code available at https://github.com/ tanyuqian/synthetic-private-data |
| Open Datasets | Yes | Both components are pre-trained on the large-scale public corpora, Slim Pajama (Soboleva et al., 2023) and Wikipedia (Foundation, 2023), respectively. ... Specifically, we include Pub Med (Yu et al., 2023) to represent the academic medical domain, Chatbot Arena (Zheng et al., 2023) for human-to-machine interactions, and Multi Session Chat (Xu, 2021) for human-to-human everyday dialogues. ... We conduct experiments on two classification tasks: Yelp (Yelp, Inc.) and Open Review (Xie et al., 2024) |
| Dataset Splits | Yes | C. Dataset Sizes. Dataset Train Valid Test. Pub Med 75,316 14,423 4,453. Chatbot Arena 180,000 5,000 3,819. Multi-Session Chat 17,940 3,000 2,505. Yelp 1,939,290 5,000 5,000. Open Review 8,396 2,798 2,798. |
| Hardware Specification | Yes | The implementation of the pretraining is based on Red Coast (Tan et al., 2024) using bf16 mixed precision and the pretraining takes approximately 24 hours on 256 TPU-v4 cores (Jouppi et al., 2023). |
| Software Dependencies | Yes | The implementation of the pipeline above is based on BERTopic (Grootendorst, 2022). ... we use DP-Adam for DP finetuning and follow the standard Gaussian mechanism to obtain (ϵ, δ)-DP guarantee. Compared to the Vanilla DP Finetune approach, the noise multiplier used by our method is slightly larger, because we need to allocate a small portion of the privacy budget to the DP topic histogram (see Appendix D). Besides, GPT2XL-1.5B has much smaller noise multipliers because we reduce the training batch size from 4096 to 256 to save computational resources. For other non-DP training hyperparameters, see 4.1.3. ... the standard dp accounting package (DP Team, 2022). |
| Experiment Setup | Yes | For all settings involving DP finetuning, we use DP-Adam for 2000 steps with a batch size of 4096, a gradient norm clip of 1.0, and a weight decay of 0.1. The learning rate follows a linear decay schedule with 100 warmup steps, and the peak learning rate is selected from the range [1, 4] [10 3, 10 4, 10 5] based on validation performance. The privacy budget accounts for both DP model finetuning and the collection of DP topic histogram statistics. We apply a Gaussian noise multiplier of 10 to the DP topic histogram. ... For the sample generation process, we generate 400K synthetic examples using nucleus sampling with top-p = 0.95 and a maximum sequence length of 512 tokens. ... For generative tasks, we train the causal versions of BERTMini and BERTSmall using a linear learning rate schedule from 0.0003 to 0, a batch size of 64, and a total of 6000 steps, with a weight decay of 0.01. For classification tasks, we finetune a Ro BERTa-base model under the same hyperparameter settings as in generative tasks above, except for a learning rate of 3 10 5. |