reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Black-Box Detection of Language Model Watermarks

Authors: Thibaud Gloaguen, Nikola Jovanović, Robin Staab, Martin Vechev

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally confirm the effectiveness of our methods on a range of schemes and a diverse set of open-source models. Further, we validate the feasibility of our tests on real-world APIs.
Researcher Affiliation	Academia	Thibaud Gloaguen, Nikola Jovanovi c, Robin Staab, Martin Vechev ETH Zurich EMAIL, EMAIL
Pseudocode	Yes	We present an additional algorithmic description of the Red-Green test ( 2) in Algorithm 2, the Fixed-Sampling test ( 3) in Algorithm 4 and the Cache-Augmented test ( 4) in Algorithm 5.
Open Source Code	Yes	Our code is publicly available at https://github.com/eth-sri/ watermark-detection.
Open Datasets	No	To generate the samples for the original watermark detector, following the method in (Kirchenbauer et al., 2023), we generate 100 completions of 200 tokens, using prompts sampled from C4.
Dataset Splits	No	The paper describes experimental setups with specific query numbers and repetitions (e.g., "N1 = 10, N1 = 9, r = 1.96", "n = 1000 queries", "Q1 = Q2 = 75"), but it does not specify training/test/validation splits for any datasets. The focus is on statistical testing of generated text, not on traditional model training or evaluation splits.
Hardware Specification	No	The paper does not explicitly provide specific hardware details (e.g., GPU models, CPU types, or memory specifications) used for running its experiments. It mentions testing on black-box LLM deployments (GPT4, CLAUDE 3, GEMINI 1.0 PRO), but these are the target systems, not the hardware used by the authors for their experimental setup.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies, libraries, frameworks, or operating systems used in its methodology or experimental setup. It only discusses the conceptual aspects of watermark detection and the models under evaluation.
Experiment Setup	Yes	For Red-Green tests, we set N1 = 10, N1 = 9, r = 1.96, a different Σ per model based on the first Q1 samples, use 100 samples to estimate the probabilities, and use 10000 permutations in the test. ... For Fixed-Sampling tests, we use n = 1000 queries and set t = 50. For Cache-Augmented tests, we use Q1 = Q2 = 75 and assume the cache is cleared between queries in the second phase.