reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust Multi-bit Text Watermark with LLM-based Paraphrasers

Authors: Xiaojun Xu, Jinghan Jia, Yuanshun Yao, Yang Liu, Hang Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we show that our watermarks can achieve over 99.99% detection AUC with small (1.1B) text paraphrasers while keeping the semantic information of the original sentence. More importantly, our pipeline is robust under word substitution and sentence paraphrasing perturbations and generalizes well to out-of-distributional data. We also show the stealthiness of our watermark with LLMbased evaluation.
Researcher Affiliation	Collaboration	1Byte Dance Research 2Michigan State University 3University of California, Santa Cruz. Correspondence to: Xiaojun Xu <EMAIL>.
Pseudocode	Yes	The encoding algorithm is shown in Alg. 1. We track the current watermark bit, and the next token is generated with the corresponding paraphraser θbit. After each generation step, we check whether the next token will be in a new segment by calculating S(xw; mode=E). If the new segment starts, we will update bit to be the next bit in the watermark message.
Open Source Code	Yes	We open-source the code: https://github.com/xiaojunxu/multi-bit-text-watermark.
Open Datasets	Yes	The encoder and decoder are trained and evaluated on the C4 Real News Like dataset (Raffel et al., 2020), processed using standard settings in (Kirchenbauer et al., 2023; Xu et al., 2024; Lau et al., 2024). Without specification, we will use texts with 128 tokens for training and evaluation.
Dataset Splits	No	The encoder and decoder are trained and evaluated on the C4 Real News Like dataset (Raffel et al., 2020), processed using standard settings in (Kirchenbauer et al., 2023; Xu et al., 2024; Lau et al., 2024). Without specification, we will use texts with 128 tokens for training and evaluation.
Hardware Specification	No	We use a relatively small Tiny Llama-1.1b model architecture (Zhang et al., 2024a) for θ0, θ1 and θd, as we observe that small models can already achieve a good performance in paraphrasing and watermarking. We show the experiments with larger Llama2-7b models in Appendix C.
Software Dependencies	No	We use a relatively small Tiny Llama-1.1b model architecture (Zhang et al., 2024a) for θ0, θ1 and θd, as we observe that small models can already achieve a good performance in paraphrasing and watermarking. We show the experiments with larger Llama2-7b models in Appendix C.
Experiment Setup	Yes	We fine-tune the model for 10,000 steps with batch size of 4. We use λw = 0.1, λs = 1.0 and λk = 0.02 as the coefficients. In the initialization stage, we will generate the paraphrased data x SF T para with Pegasus paraphraser (Zhang et al., 2020), and use λJS = 1.0 for the intialization loss.