reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs

Authors: Bowen Tan, Zheng Xu, Eric Xing, Zhiting Hu, Shanshan Wu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation across ﬁve diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.
Researcher Affiliation	Collaboration	1Carnegie Mellon University 2Google Research 3Mohamed bin Zayed University of Artificial Intelligence 4UC San Diego. Correspondence to: Bowen Tan, Shanshan Wu <EMAIL, EMAIL>.
Pseudocode	Yes	H. Pseudo-code. We provide pseudo-code of our framework corresponding to Section 3 and Figure 1. Algorithm 1 A. Pretraining with Public Corpora CTCL-Topic Development ... Algorithm 2 B. Learning the Private Domain Private Topic Histogram Construction ... Algorithm 3 C. Data Synthesis
Open Source Code	Yes	Code available at https://github.com/ tanyuqian/synthetic-private-data
Open Datasets	Yes	Both components are pre-trained on the large-scale public corpora, Slim Pajama (Soboleva et al., 2023) and Wikipedia (Foundation, 2023), respectively. ... Speciﬁcally, we include Pub Med (Yu et al., 2023) to represent the academic medical domain, Chatbot Arena (Zheng et al., 2023) for human-to-machine interactions, and Multi Session Chat (Xu, 2021) for human-to-human everyday dialogues. ... We conduct experiments on two classiﬁcation tasks: Yelp (Yelp, Inc.) and Open Review (Xie et al., 2024)
Dataset Splits	Yes	C. Dataset Sizes. Dataset Train Valid Test. Pub Med 75,316 14,423 4,453. Chatbot Arena 180,000 5,000 3,819. Multi-Session Chat 17,940 3,000 2,505. Yelp 1,939,290 5,000 5,000. Open Review 8,396 2,798 2,798.
Hardware Specification	Yes	The implementation of the pretraining is based on Red Coast (Tan et al., 2024) using bf16 mixed precision and the pretraining takes approximately 24 hours on 256 TPU-v4 cores (Jouppi et al., 2023).
Software Dependencies	Yes	The implementation of the pipeline above is based on BERTopic (Grootendorst, 2022). ... we use DP-Adam for DP ﬁnetuning and follow the standard Gaussian mechanism to obtain (ϵ, δ)-DP guarantee. Compared to the Vanilla DP Finetune approach, the noise multiplier used by our method is slightly larger, because we need to allocate a small portion of the privacy budget to the DP topic histogram (see Appendix D). Besides, GPT2XL-1.5B has much smaller noise multipliers because we reduce the training batch size from 4096 to 256 to save computational resources. For other non-DP training hyperparameters, see 4.1.3. ... the standard dp accounting package (DP Team, 2022).
Experiment Setup	Yes	For all settings involving DP ﬁnetuning, we use DP-Adam for 2000 steps with a batch size of 4096, a gradient norm clip of 1.0, and a weight decay of 0.1. The learning rate follows a linear decay schedule with 100 warmup steps, and the peak learning rate is selected from the range [1, 4] [10 3, 10 4, 10 5] based on validation performance. The privacy budget accounts for both DP model ﬁnetuning and the collection of DP topic histogram statistics. We apply a Gaussian noise multiplier of 10 to the DP topic histogram. ... For the sample generation process, we generate 400K synthetic examples using nucleus sampling with top-p = 0.95 and a maximum sequence length of 512 tokens. ... For generative tasks, we train the causal versions of BERTMini and BERTSmall using a linear learning rate schedule from 0.0003 to 0, a batch size of 64, and a total of 6000 steps, with a weight decay of 0.01. For classiﬁcation tasks, we ﬁnetune a Ro BERTa-base model under the same hyperparameter settings as in generative tasks above, except for a learning rate of 3 10 5.

Synthesizing Privacy-Preserving Text Data via Finetuning *without* Finetuning Billion-Scale LLMs

Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs