Private Federated Learning using Preference-Optimized Synthetic Data
Authors: Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate POPri, we release Large Fed Bench, a new federated text benchmark for uncontaminated LLM evaluations on federated client data. POPri closes the gap in performance between the fully-private and non-private settings by up to 58%, compared to 28% for prior synthetic data methods, and 3% for state-of-the-art DP federated learning methods. The code and data are available at https://github.com/meiyuw/POPri. Section 5. Experiments |
| Researcher Affiliation | Collaboration | 1Department of ECE, Carnegie Mellon University, Pittsburgh, PA 2Pittsburgh Supercomputing Center, Pittsburgh, USA 3Coldrays, Tucson, AZ. Correspondence to: Charlie Hou <EMAIL>. |
| Pseudocode | Yes | Pseudocode can be found in Algorithm 1. We highlight the algorithmically new steps (that differ from PE) in blue . Algorithm 2 POPri (central DP, unconditional) Algorithm 3 POPri (central DP, conditional) Algorithm 4 SIMILARITY Algorithm 5 CENTRALSCORE |
| Open Source Code | Yes | The code and data are available at https://github.com/meiyuw/POPri. |
| Open Datasets | Yes | To evaluate POPri, we release Large Fed Bench, a new federated text benchmark for uncontaminated LLM evaluations on federated client data. The code and data are available at https://github.com/meiyuw/POPri. Congressional Speeches ( Congress )1 is a dataset of 134k speeches or debates scraped from congressional or parliamentary transcripts in the US, UK, and Canada. 1https://huggingface.co/datasets/hazylavender/Congressional Dataset bio Rxiv2 is a dataset of 57k abstracts 2https://huggingface.co/datasets/hazylavender/biorxiv-abstract as well as Pub Med (Yu et al., 2023; Xie et al., 2024) used in the evaluation of Private Evolution (Aug-PE) (Xie et al., 2024). ... For text classification, we evaluate POPri on Open Review consisting of ICLR 2023 reviews published on November 5, 2022 which was used in the evaluation of Private Evolution (Aug-PE) (Xie et al., 2024). |
| Dataset Splits | Yes | Table 4. Dataset details. Dataset # Train Samples # Validation Samples # Test Samples Max Sequence Length Average # of samples per client bio Rxiv 72000 2000 1584 64 6.6 2.6 Congressional Speeches 133000 4200 1547 64 5.0 16.3 Pub Med 75316 14423 4453 512 1 |
| Hardware Specification | Yes | This work used Bridges-2 GPU (Brown et al., 2021; Buitrago & Nystrom, 2021) at the Pittsburgh Supercomputing Center through allocation CIS240135 and CIS240937 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants 2138259, 2138286, 2138307, 2137603, and 2138296 (Boerner et al., 2023). The authors acknowledge the National Artificial Intelligence Research Resource (NAIRR) Pilot, the AI and Big Data group at the Pittsburgh Supercomputing Center, and NCSA Delta GPU for contributing to this research result. |
| Software Dependencies | No | The paper mentions models like LLaMA-3-8B, Distil GPT2, BERTsmall, all-Mini LM-L6-v2 sentence transformer, sentence-t5-base sentence transformer, RoBERTa, and techniques like LoRA fine-tuning and the AdamW optimizer. It also refers to the Opacus library for privacy accounting. However, no specific version numbers are provided for any programming languages, libraries (including Opacus), or other key software components used in the experiments. |
| Experiment Setup | Yes | To fine-tune the LLa MA-3-8B model, we use Lo RA fine-tuning with rank 4, α = 8, applied to all the projection matrices in LLa MA-3-8B. We adapt the Adam W optimizer with a cosine learning rate scheduler with the learning rate ranging from 3 10 7 to 8 10 7. ... For each iteration, we fine-tune the models for 2 epochs and select the best checkpoint with the lowest FID score relative to the validation dataset. This checkpoint is used for synthetic data generation and as the starting point for the next iteration. The batch size is set to 24. ... For Bio Rxiv & Congressional Speeches, we fine-tuned the pre-trained Distill GPT2 for next-token prediction. We set the max sequence length as 64, number of generated synthetic data as 1,000,000, the batch size as 160, the learning rate as 2e-4, and the number of epochs as 80. ... For Pub Med, ... We set the max sequence length as 512, number of generated synthetic data as 2000, batch size as 32, learning rate as 3e-4, the weight decay as 0.01, and the number of epochs as 10. |