Efficient and Privacy-Preserving Soft Prompt Transfer for LLMs

Authors: Xun Wang, Jing Xu, Franziska Boenisch, Michael Backes, Christopher A. Choquette-Choo, Adam Dziedzic

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our thorough experimental evaluation on both masked language models and auto-regressive language models demonstrates that our method can efficiently, effectively, and privately transfer soft prompts with high utility. Our code is available at https://github.com/sprintml/POST. 1CISPA Helmholtz Center for Information Security, Saarbr ucken, Germany 2Google Deep Mind. Correspondence to: Franziska Boenisch <EMAIL>, Adam Dziedzic <EMAIL>. Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). 1. Introduction... 5. Empirical Evaluation
Researcher Affiliation Collaboration 1CISPA Helmholtz Center for Information Security, Saarbr ucken, Germany 2Google Deep Mind. Correspondence to: Franziska Boenisch <EMAIL>, Adam Dziedzic <EMAIL>.
Pseudocode No The paper describes the methodology in prose and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/sprintml/POST.
Open Datasets Yes Following prior work (Hong et al., 2023; Wu et al., 2023), we evaluate the performance of our proposed method on five classification-task datasets: sst2 from the GLUE benchmark (Wang et al., 2019), imdb (Maas et al., 2011), tweet (Rosenthal et al., 2017), arisetv (Okite, 2022) and mpqa (Wiebe et al., 2005). ... As public data, we also include agnews (Zhang et al., 2015) and boolq (Clark et al., 2019) for the classification task, while for the generation task, we use AIE (Kudari, 2022). We also include disaster (Crowd Flower, 2019) and trec (Li & Roth, 2002) for baseline comparison and ablation on the choice of public data.
Dataset Splits Yes To evaluate the success of our method, we report the accuracy on the test data split of our private datasets for the teacher LLM with the transferred prompt (Private Transfer). ... Datasets. Following prior work (Hong et al., 2023; Wu et al., 2023), we evaluate the performance of our proposed method on five classification-task datasets: sst2 from the GLUE benchmark (Wang et al., 2019), imdb (Maas et al., 2011), tweet (Rosenthal et al., 2017), arisetv (Okite, 2022) and mpqa (Wiebe et al., 2005).
Hardware Specification Yes All experiments are executed on a single A100 GPU.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers, such as 'Python 3.x' or 'PyTorch X.Y'.
Experiment Setup Yes We experiment with various degrees of compression in the KD (see Appendix D.4). For the results presented in the main body of the paper, we compress the 12-layer Roberta-base and the 32-layer Llama2-7b to 2 layers, and the 48-layer GPT2-XL to 4 layers. We use the Bookcorpus (Zhu et al., 2015) dataset for the KD. ... Following Su et al. (2022), we initialize our soft prompts with 100 tokens. The hyperparameters for prompt tuning per dataset, including the δ for the DP setup, are presented in Table 10. ... By default, we use 5000 steps for Roberta-base, 8000 steps for GPT2-XL, and 6000 steps for Llama2-7b. ... Table 8: Hyperparameters in Knowledge Distillation. ... Table 11: Hyperparameters used during Prompt Transfer. ... Table 12: Setting of α for Different Datasets and Models during Prompt Trasnfer.