reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings

Authors: Saksham Rastogi, Pratyush Maini, Danish Pruthi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate the effectiveness of our approach by continually pretraining the Pythia 1B model (Biderman et al., 2023) on deliberately contaminated pretraining data. We contaminate the pretraining corpus by injecting test examples from four different benchmarks. Even with minimal contamination that is, each test example appearing only once and each benchmark comprising less than 0.001% of the total training data our approach significantly outperforms existing methods, achieving statistically significant p-values across all contaminated benchmarks. We also conduct a false positive analysis, wherein we apply our detection methodology to off-the-shelf pretrained LLMs that have not been exposed to the watermarked benchmarks and find that our test successfully denies their membership.
Researcher Affiliation	Collaboration	1Indian Institute of Science 2Carnegie Mellon University 3Datology AI. Correspondence to: Saksham Rastogi <EMAIL>, Pratyush Maini <EMAIL>.
Pseudocode	No	The paper describes methods using natural language and mathematical equations (e.g., Equation 1 for modified logits, Equation 2 for perplexity difference, Equation 3 for t-test statistic, Equation 4 for multiple private keys), and includes diagrams (Figure 1), but does not feature any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	We make all our code, data and models available at github.com/codeboy5/STAMP
Open Datasets	Yes	We evaluate our approach using four widely-used benchmarks: Trivia QA (Joshi et al., 2017), ARC-C (Clark et al., 2018), MMLU (Hendrycks et al., 2021), and GSM8K (Cobbe et al., 2021). We contaminate the pretraining corpus by injecting test examples from four different benchmarks. The corpus is a combination of Open Web Text (Contributors, 2023) and public watermarked version of the four benchmarks. To demonstrate STAMP s effectiveness in detecting unlicensed use of copyrighted data in model training, we present two expository case studies. Specifically, we apply STAMP to detect membership of (1) abstracts from EMNLP 2024 proceedings (emn, 2001) and (2) articles from the AI Snake Oil newsletter (Narayanan & Kapoor, 2023).
Dataset Splits	Yes	We sample 500 papers from EMNLP 2024 proceedings (emn, 2001) and generate watermarked rephrasings of their abstracts. Additionally, we generate watermarked rephrasings for another set of 500 abstracts, which we use as a held-out validation set for our experiments. We collect 56 posts from the popular AI Snake Oil newsletter (Narayanan & Kapoor, 2023), and use 44 for pretraining and hold 12 for validation. To analyze the effect of sample size (n) on detection power, we evaluate our test on benchmark subsets ranging from 100 to 1000 examples. We train a random forest classifier on the bag-of-words feature representations for the datasets. The classifier is trained on 80% of the member and non-member sets, with evaluation performed on the remaining 20%.
Hardware Specification	No	The paper states, "we perform continual pretraining on the 1 billion parameter Pythia model" and refers to "a modest 1B-parameter model," but it does not specify any hardware details like GPU or CPU models, memory, or specific computing platforms used for these experiments.
Software Dependencies	No	The paper mentions using
Experiment Setup	Yes	Setup. To simulate downstream benchmark contamination as it occurs in real-world scenarios and evaluate the effectiveness of our test, we perform continual pretraining on the 1 billion parameter Pythia model (Biderman et al., 2023) using an intentionally contaminated pretraining corpus. The corpus is a combination of Open Web Text (Contributors, 2023) and public watermarked version of the four benchmarks, as mentioned in Section 4.1. Each test set accounts for less than 0.001% of the pretraining corpus, with exact sizes detailed in Table 6 in the appendix. All test sets in our experiments have a duplication rate of 1 (denoting no duplication whatsoever), and the overall pretraining dataset comprises 6.7 billion tokens. Details of the exact training hyperparameters are provided in Appendix E. Appendix E: We continually pretrain Pythia 1B on intentionally contaminated Open Web Text Test case instances from the benchmark were randomly inserted between documents from Open Web Text. We trained for 1 epoch of 46000 steps with an effective batch size of 144 sequences and sequence length of 1024 tokens. We used the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 10 4, (β1, β2) = (0.9, 0.999) and no weight decay. Appendix H: For watermarking, we use KGW (Kirchenbauer et al., 2024) scheme, with context window of size 2, split ratio (γ) of 0.5 & and boosting value (δ) of 2.