reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion

Authors: Tianyuan Zou, Yang Liu, Peng Li, Yufei Xiong, Jianqing Zhang, Jingjing Liu, Xiaozhou Ye, Ye Ouyang, Ya-Qin Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on 6 well-developed datasets with 6 open-source and 3 closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available at https://github.com/LindaLydia/WASP.
Researcher Affiliation	Collaboration	1Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China 2the Hong Kong Polytechnic University, Hong Kong, China 3Shanghai Artificial Intelligence Laboratory, Shanghai, China 4the Department of Mathematics, Harbin Institute of Technology, Weihai, Shandong, China 5Shanghai Jiao Tong University, Shanghai, China 6Asia Info Technologies, Shanghai, China. Correspondence to: Yang Liu <EMAIL>.
Pseudocode	Yes	Algorithm 1 WASP Input: K PLMs {Pk}K k=1 with empty synthetic dataset {Dk }K k=1; 1 data party with private dataset B of size M belonging to C categories; number of in-context samples S; number of iterations T taken to obtain in total N synthetic samples; initialized PLM weights {wk = 1/K}K k=1; learning rate η; DP privacy parameters ϵ, δ, δiter; test dataset A; random initialized STM m(0). Output: STM m. ... Algorithm 2 WASP for Distributed Federated Data (L > 1) ... Algorithm 3 Functions used in Algorithms 1 and 2 for WASP
Open Source Code	Yes	Code is available at https://github.com/LindaLydia/WASP.
Open Datasets	Yes	We evaluate on 6 widely used tasks: 1) IMDb (Maas et al., 2011) (2 categories) for movie-review semantic analysis task; 2) Yelp-Category (Inc. Yelp, 2015) (10 categories) for business-review item field classification task; 3) Yelp-Rating (Inc. Yelp, 2015) (5 categories) for business-review rating classification task; 4) Openreview Category (Xie et al., 2024) (12 categories) for paper-review classification by research area task; 5) Openreview Rating (Xie et al., 2024) (5 categories) for paper-review classification by review rating task; and 6) Banking (10 categories selected from Banking77 (Casanueva et al., 2020)) for online-banking queries field classification task.
Dataset Splits	Yes	By default, we use 100 private samples (M = 100) for main experiments. For federated data (L > 1) scenario, we use L = 10 private data parties which control 300 private samples (M = P10 l=1 \|Bl\| = 300) altogether. To better align with real-world scenarios, each participating data-party controls private datasets that are non-i.i.d. to each other, and aggregate to an unbalanced dataset. We follow Dirichlet Partition (Yurochkin et al., 2019; Hsu et al., 2019; Zhang et al., 2023) to distribute private samples to each party with parameter α = 1.0. For the DP synthetic dataset, we generate a total of 6,000 samples from all participating PLMs within 5 iteration. ... B is randomly drawn from the training sets of these datasets with their test sets used to evaluate trained STM.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	No	The paper mentions specific pre-trained models like GPT-2, Llama-2, Vicuna, OPT, Chat GLM3, Flan-T5, GPT-3.5, GPT-4, GPT-4o, BERT, and sentencet5-base, but does not provide specific version numbers for software dependencies or libraries used for implementation.
Experiment Setup	Yes	By default, we use 100 private samples (M = 100) for main experiments. For federated data (L > 1) scenario, we use L = 10 private data parties which control 300 private samples (M = P10 l=1 \|Bl\| = 300) altogether. ... For the DP synthetic dataset, we generate a total of 6,000 samples from all participating PLMs within 5 iteration. Since the first iteration does not use private sample information for feedback, only the last 4 iterations are sensitive to privacy. By default, we use δiter = 1 10 5 in our experiments and list only ϵ alongside the results. The notion of DP is sample-level DP unless otherwise stated.