reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient and Privacy-Preserving Soft Prompt Transfer for LLMs

Authors: Xun Wang, Jing Xu, Franziska Boenisch, Michael Backes, Christopher A. Choquette-Choo, Adam Dziedzic

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our thorough experimental evaluation on both masked language models and auto-regressive language models demonstrates that our method can efficiently, effectively, and privately transfer soft prompts with high utility. Our code is available at https://github.com/sprintml/POST. 1CISPA Helmholtz Center for Information Security, Saarbr ucken, Germany 2Google Deep Mind. Correspondence to: Franziska Boenisch <EMAIL>, Adam Dziedzic <EMAIL>. Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). 1. Introduction... 5. Empirical Evaluation
Researcher Affiliation	Collaboration	1CISPA Helmholtz Center for Information Security, Saarbr ucken, Germany 2Google Deep Mind. Correspondence to: Franziska Boenisch <EMAIL>, Adam Dziedzic <EMAIL>.
Pseudocode	No	The paper describes the methodology in prose and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/sprintml/POST.
Open Datasets	Yes	Following prior work (Hong et al., 2023; Wu et al., 2023), we evaluate the performance of our proposed method on five classification-task datasets: sst2 from the GLUE benchmark (Wang et al., 2019), imdb (Maas et al., 2011), tweet (Rosenthal et al., 2017), arisetv (Okite, 2022) and mpqa (Wiebe et al., 2005). ... As public data, we also include agnews (Zhang et al., 2015) and boolq (Clark et al., 2019) for the classification task, while for the generation task, we use AIE (Kudari, 2022). We also include disaster (Crowd Flower, 2019) and trec (Li & Roth, 2002) for baseline comparison and ablation on the choice of public data.
Dataset Splits	Yes	To evaluate the success of our method, we report the accuracy on the test data split of our private datasets for the teacher LLM with the transferred prompt (Private Transfer). ... Datasets. Following prior work (Hong et al., 2023; Wu et al., 2023), we evaluate the performance of our proposed method on five classification-task datasets: sst2 from the GLUE benchmark (Wang et al., 2019), imdb (Maas et al., 2011), tweet (Rosenthal et al., 2017), arisetv (Okite, 2022) and mpqa (Wiebe et al., 2005).
Hardware Specification	Yes	All experiments are executed on a single A100 GPU.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers, such as 'Python 3.x' or 'PyTorch X.Y'.
Experiment Setup	Yes	We experiment with various degrees of compression in the KD (see Appendix D.4). For the results presented in the main body of the paper, we compress the 12-layer Roberta-base and the 32-layer Llama2-7b to 2 layers, and the 48-layer GPT2-XL to 4 layers. We use the Bookcorpus (Zhu et al., 2015) dataset for the KD. ... Following Su et al. (2022), we initialize our soft prompts with 100 tokens. The hyperparameters for prompt tuning per dataset, including the δ for the DP setup, are presented in Table 10. ... By default, we use 5000 steps for Roberta-base, 8000 steps for GPT2-XL, and 6000 steps for Llama2-7b. ... Table 8: Hyperparameters in Knowledge Distillation. ... Table 11: Hyperparameters used during Prompt Transfer. ... Table 12: Setting of α for Different Datasets and Models during Prompt Trasnfer.