Necessary and Sufficient Watermark for Large Language Models

Authors: Yuki Takezawa, Ryoma Sato, Han Bao, Kenta Niwa, Makoto Yamada

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through the experiments, we demonstrate that the NS-Watermark can generate more natural texts than existing watermarking methods and distinguish more accurately between texts written by LLMs and those written by humans. Especially in machine translation tasks, the NS-Watermark can outperform the existing watermarking method by up to 30 BLEU scores.
Researcher Affiliation Collaboration Yuki Takezawa EMAIL Kyoto University, OIST Ryoma Sato EMAIL NII Han Bao EMAIL Kyoto University, OIST Kenta Niwa EMAIL NTT Communication Science Laboratories Makoto Yamada EMAIL OIST
Pseudocode Yes Algorithm 1: Naive algorithm for the NS-Watermark. Input: Maximum number of words Tmax, vocabulary V , beam size k, and hyperparameter γ, Z. Algorithm 2: Linear time algorithm for the NS-Watermark. Input: Maximum number of words Tmax, vocabulary V , beam size k, the length of generated texts without watermarks b T, hyperparameter γ, Z, α. Algorithm 3: Adaptive Soft-Watermark. Input: Maximum number of words Tmax, vocabulary V , beam size k, hyperparameter γ, Z, α, and set . Algorithm 4: Adaptive Soft-Watermark with β. Input: Maximum number of words Tmax, vocabulary V , beam size k, hyperparameter γ, Z, α, β, and set .
Open Source Code No The paper does not contain any explicit statement about providing source code or a link to a repository for their implementation.
Open Datasets Yes We used NLLB-200-3.3B model (Team et al., 2022) with the test dataset of WMT 14 French (Fr) English (En) and WMT 16 German (De) English (En). We used LLa MA-7B model (Touvron et al., 2023) with the subsets of C4, realnews-like dataset (Raffel et al., 2020).
Dataset Splits Yes We split the data into the validation and test datasets with 10/90 ratio and used the validation dataset to tune. To tune hyperparameters, we split the dataset into validation and test datasets with 10/90 ratio.
Hardware Specification Yes All experiments were run with an A100 GPU.
Software Dependencies No The paper does not provide specific software versions for libraries or frameworks used in the experiments.
Experiment Setup Yes Following the prior work (Kirchenbauer et al., 2023a), we set the hyperparameter Z to 4. For other hyperparameters, we split the data into the validation and test datasets with 10/90 ratio and used the validation dataset to tune. For the NS-Watermark, we selected γ with the best BLEU score (Papineni et al., 2002) on the validation dataset using a grid search. ...we selected the hyperparameters of these methods with the best BLEU score while achieving more than 95% FNR in the validation dataset using a grid search. See Sec. D for more detailed hyperparameter settings.