reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cape: Context-Aware Prompt Perturbation Mechanism with Differential Privacy

Authors: Haoqi Wu, Wei Dai, Wang Li, Qiang Yan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across multiple datasets, along with ablation studies, demonstrate that Cape achieves a better privacy-utility tradeoff compared to prior state-of-the-art works.
Researcher Affiliation	Industry	Haoqi Wu 1 Wei Dai 1 Li Wang 1 Qiang Yan 1 1Tik Tok. Correspondence to: Haoqi Wu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Equal width Bucketing: Bucket Input: Utility score vector u, the number of buckets Nb Output: A set of buckets B with different tokens. ... Algorithm 2 Cape Mechanism: R Input: Prompt x = {t1, t2, ..., tn}, device model M, embedding table e, importance factors λL, λD, bucket number Nb, and budget ϵ > 0 Output: Perturbed prompt ˆx R(x)
Open Source Code	No	The paper does not provide a direct link to a code repository or an explicit statement of code release for the methodology described in this paper. It mentions
Open Datasets	Yes	For the former, we follow prior works (Yue et al., 2021; Chen et al., 2022) to use two GLUE datasets with privacy implications. 1) SST-2: This contains sentiment annotations for movie reviews...; 2) QNLI: This is a dataset containing sentence pairs for binary classification... For the latter, we follow (Tong et al., 2023) to use Wikitext-103-v1, a large-scale dataset derived from Wikipedia articles for language modeling tasks.
Dataset Splits	Yes	For the former, we follow prior works (Yue et al., 2021; Chen et al., 2022) to use two GLUE datasets with privacy implications. 1) SST-2: This contains sentiment annotations for movie reviews, which is used to perform sentiment classification (positive or negative); 2) QNLI: This is a dataset containing sentence pairs for binary classification (entailment/not entailment). We use accuracy as the metric. ... For the latter, we follow (Tong et al., 2023) to use Wikitext-103-v1, a large-scale dataset derived from Wikipedia articles for language modeling tasks. ... on the validation set of SST-2 and QNLI datasets.
Hardware Specification	Yes	All the experiments are carried out on one Debian 11 machine equipped with one Intel Xeon Platinum 8260 CPU (6 cores and 2.40GHz), 16GB of RAM and 4 Nvidia Tesla-V100-SXM2-32GB GPUs.
Software Dependencies	No	The paper mentions 'one Debian 11 machine' as the operating system but does not provide specific version numbers for other software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or CUDA.
Experiment Setup	Yes	By default, we set λL = 0.5, λD = 1.0 and Nb = 50. We run inference on the original data as non-private baseline. ... (default temperature of 0.5).