STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings
Authors: Saksham Rastogi, Pratyush Maini, Danish Pruthi
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate the effectiveness of our approach by continually pretraining the Pythia 1B model (Biderman et al., 2023) on deliberately contaminated pretraining data. We contaminate the pretraining corpus by injecting test examples from four different benchmarks. Even with minimal contamination that is, each test example appearing only once and each benchmark comprising less than 0.001% of the total training data our approach significantly outperforms existing methods, achieving statistically significant p-values across all contaminated benchmarks. We also conduct a false positive analysis, wherein we apply our detection methodology to off-the-shelf pretrained LLMs that have not been exposed to the watermarked benchmarks and find that our test successfully denies their membership. |
| Researcher Affiliation | Collaboration | 1Indian Institute of Science 2Carnegie Mellon University 3Datology AI. Correspondence to: Saksham Rastogi <EMAIL>, Pratyush Maini <EMAIL>. |
| Pseudocode | No | The paper describes methods using natural language and mathematical equations (e.g., Equation 1 for modified logits, Equation 2 for perplexity difference, Equation 3 for t-test statistic, Equation 4 for multiple private keys), and includes diagrams (Figure 1), but does not feature any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | We make all our code, data and models available at github.com/codeboy5/STAMP |
| Open Datasets | Yes | We evaluate our approach using four widely-used benchmarks: Trivia QA (Joshi et al., 2017), ARC-C (Clark et al., 2018), MMLU (Hendrycks et al., 2021), and GSM8K (Cobbe et al., 2021). We contaminate the pretraining corpus by injecting test examples from four different benchmarks. The corpus is a combination of Open Web Text (Contributors, 2023) and public watermarked version of the four benchmarks. To demonstrate STAMP s effectiveness in detecting unlicensed use of copyrighted data in model training, we present two expository case studies. Specifically, we apply STAMP to detect membership of (1) abstracts from EMNLP 2024 proceedings (emn, 2001) and (2) articles from the AI Snake Oil newsletter (Narayanan & Kapoor, 2023). |
| Dataset Splits | Yes | We sample 500 papers from EMNLP 2024 proceedings (emn, 2001) and generate watermarked rephrasings of their abstracts. Additionally, we generate watermarked rephrasings for another set of 500 abstracts, which we use as a held-out validation set for our experiments. We collect 56 posts from the popular AI Snake Oil newsletter (Narayanan & Kapoor, 2023), and use 44 for pretraining and hold 12 for validation. To analyze the effect of sample size (n) on detection power, we evaluate our test on benchmark subsets ranging from 100 to 1000 examples. We train a random forest classifier on the bag-of-words feature representations for the datasets. The classifier is trained on 80% of the member and non-member sets, with evaluation performed on the remaining 20%. |
| Hardware Specification | No | The paper states, "we perform continual pretraining on the 1 billion parameter Pythia model" and refers to "a modest 1B-parameter model," but it does not specify any hardware details like GPU or CPU models, memory, or specific computing platforms used for these experiments. |
| Software Dependencies | No | The paper mentions using |
| Experiment Setup | Yes | Setup. To simulate downstream benchmark contamination as it occurs in real-world scenarios and evaluate the effectiveness of our test, we perform continual pretraining on the 1 billion parameter Pythia model (Biderman et al., 2023) using an intentionally contaminated pretraining corpus. The corpus is a combination of Open Web Text (Contributors, 2023) and public watermarked version of the four benchmarks, as mentioned in Section 4.1. Each test set accounts for less than 0.001% of the pretraining corpus, with exact sizes detailed in Table 6 in the appendix. All test sets in our experiments have a duplication rate of 1 (denoting no duplication whatsoever), and the overall pretraining dataset comprises 6.7 billion tokens. Details of the exact training hyperparameters are provided in Appendix E. Appendix E: We continually pretrain Pythia 1B on intentionally contaminated Open Web Text Test case instances from the benchmark were randomly inserted between documents from Open Web Text. We trained for 1 epoch of 46000 steps with an effective batch size of 144 sequences and sequence length of 1024 tokens. We used the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 10 4, (β1, β2) = (0.9, 0.999) and no weight decay. Appendix H: For watermarking, we use KGW (Kirchenbauer et al., 2024) scheme, with context window of size 2, split ratio (γ) of 0.5 & and boosting value (δ) of 2. |