reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Private Federated Learning using Preference-Optimized Synthetic Data

Authors: Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate POPri, we release Large Fed Bench, a new federated text benchmark for uncontaminated LLM evaluations on federated client data. POPri closes the gap in performance between the fully-private and non-private settings by up to 58%, compared to 28% for prior synthetic data methods, and 3% for state-of-the-art DP federated learning methods. The code and data are available at https://github.com/meiyuw/POPri. Section 5. Experiments
Researcher Affiliation	Collaboration	1Department of ECE, Carnegie Mellon University, Pittsburgh, PA 2Pittsburgh Supercomputing Center, Pittsburgh, USA 3Coldrays, Tucson, AZ. Correspondence to: Charlie Hou <EMAIL>.
Pseudocode	Yes	Pseudocode can be found in Algorithm 1. We highlight the algorithmically new steps (that differ from PE) in blue . Algorithm 2 POPri (central DP, unconditional) Algorithm 3 POPri (central DP, conditional) Algorithm 4 SIMILARITY Algorithm 5 CENTRALSCORE
Open Source Code	Yes	The code and data are available at https://github.com/meiyuw/POPri.
Open Datasets	Yes	To evaluate POPri, we release Large Fed Bench, a new federated text benchmark for uncontaminated LLM evaluations on federated client data. The code and data are available at https://github.com/meiyuw/POPri. Congressional Speeches ( Congress )1 is a dataset of 134k speeches or debates scraped from congressional or parliamentary transcripts in the US, UK, and Canada. 1https://huggingface.co/datasets/hazylavender/Congressional Dataset bio Rxiv2 is a dataset of 57k abstracts 2https://huggingface.co/datasets/hazylavender/biorxiv-abstract as well as Pub Med (Yu et al., 2023; Xie et al., 2024) used in the evaluation of Private Evolution (Aug-PE) (Xie et al., 2024). ... For text classification, we evaluate POPri on Open Review consisting of ICLR 2023 reviews published on November 5, 2022 which was used in the evaluation of Private Evolution (Aug-PE) (Xie et al., 2024).
Dataset Splits	Yes	Table 4. Dataset details. Dataset # Train Samples # Validation Samples # Test Samples Max Sequence Length Average # of samples per client bio Rxiv 72000 2000 1584 64 6.6 2.6 Congressional Speeches 133000 4200 1547 64 5.0 16.3 Pub Med 75316 14423 4453 512 1
Hardware Specification	Yes	This work used Bridges-2 GPU (Brown et al., 2021; Buitrago & Nystrom, 2021) at the Pittsburgh Supercomputing Center through allocation CIS240135 and CIS240937 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants 2138259, 2138286, 2138307, 2137603, and 2138296 (Boerner et al., 2023). The authors acknowledge the National Artificial Intelligence Research Resource (NAIRR) Pilot, the AI and Big Data group at the Pittsburgh Supercomputing Center, and NCSA Delta GPU for contributing to this research result.
Software Dependencies	No	The paper mentions models like LLaMA-3-8B, Distil GPT2, BERTsmall, all-Mini LM-L6-v2 sentence transformer, sentence-t5-base sentence transformer, RoBERTa, and techniques like LoRA fine-tuning and the AdamW optimizer. It also refers to the Opacus library for privacy accounting. However, no specific version numbers are provided for any programming languages, libraries (including Opacus), or other key software components used in the experiments.
Experiment Setup	Yes	To fine-tune the LLa MA-3-8B model, we use Lo RA fine-tuning with rank 4, α = 8, applied to all the projection matrices in LLa MA-3-8B. We adapt the Adam W optimizer with a cosine learning rate scheduler with the learning rate ranging from 3 10 7 to 8 10 7. ... For each iteration, we fine-tune the models for 2 epochs and select the best checkpoint with the lowest FID score relative to the validation dataset. This checkpoint is used for synthetic data generation and as the starting point for the next iteration. The batch size is set to 24. ... For Bio Rxiv & Congressional Speeches, we fine-tuned the pre-trained Distill GPT2 for next-token prediction. We set the max sequence length as 64, number of generated synthetic data as 1,000,000, the batch size as 160, the learning rate as 2e-4, and the number of epochs as 80. ... For Pub Med, ... We set the max sequence length as 512, number of generated synthetic data as 2000, batch size as 32, learning rate as 3e-4, the weight decay as 0.01, and the number of epochs as 10.