reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Necessary and Sufficient Watermark for Large Language Models

Authors: Yuki Takezawa, Ryoma Sato, Han Bao, Kenta Niwa, Makoto Yamada

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through the experiments, we demonstrate that the NS-Watermark can generate more natural texts than existing watermarking methods and distinguish more accurately between texts written by LLMs and those written by humans. Especially in machine translation tasks, the NS-Watermark can outperform the existing watermarking method by up to 30 BLEU scores.
Researcher Affiliation	Collaboration	Yuki Takezawa EMAIL Kyoto University, OIST Ryoma Sato EMAIL NII Han Bao EMAIL Kyoto University, OIST Kenta Niwa EMAIL NTT Communication Science Laboratories Makoto Yamada EMAIL OIST
Pseudocode	Yes	Algorithm 1: Naive algorithm for the NS-Watermark. Input: Maximum number of words Tmax, vocabulary V , beam size k, and hyperparameter γ, Z. Algorithm 2: Linear time algorithm for the NS-Watermark. Input: Maximum number of words Tmax, vocabulary V , beam size k, the length of generated texts without watermarks b T, hyperparameter γ, Z, α. Algorithm 3: Adaptive Soft-Watermark. Input: Maximum number of words Tmax, vocabulary V , beam size k, hyperparameter γ, Z, α, and set . Algorithm 4: Adaptive Soft-Watermark with β. Input: Maximum number of words Tmax, vocabulary V , beam size k, hyperparameter γ, Z, α, β, and set .
Open Source Code	No	The paper does not contain any explicit statement about providing source code or a link to a repository for their implementation.
Open Datasets	Yes	We used NLLB-200-3.3B model (Team et al., 2022) with the test dataset of WMT 14 French (Fr) English (En) and WMT 16 German (De) English (En). We used LLa MA-7B model (Touvron et al., 2023) with the subsets of C4, realnews-like dataset (Raffel et al., 2020).
Dataset Splits	Yes	We split the data into the validation and test datasets with 10/90 ratio and used the validation dataset to tune. To tune hyperparameters, we split the dataset into validation and test datasets with 10/90 ratio.
Hardware Specification	Yes	All experiments were run with an A100 GPU.
Software Dependencies	No	The paper does not provide specific software versions for libraries or frameworks used in the experiments.
Experiment Setup	Yes	Following the prior work (Kirchenbauer et al., 2023a), we set the hyperparameter Z to 4. For other hyperparameters, we split the data into the validation and test datasets with 10/90 ratio and used the validation dataset to tune. For the NS-Watermark, we selected γ with the best BLEU score (Papineni et al., 2002) on the validation dataset using a grid search. ...we selected the hyperparameters of these methods with the best BLEU score while achieving more than 95% FNR in the validation dataset using a grid search. See Sec. D for more detailed hyperparameter settings.