reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An End-to-End Model for Logits-Based Large Language Models Watermarking

Authors: Ka Him Wong, Jicheng Zhou, Jiantao Zhou, Yain-Whar Si

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that our method achieves superior robustness, outperforming distortion-free methods by 37 39% under paraphrasing and 17.2% on average, while maintaining text quality on par with the distortionfree methods in terms of text perplexity and downstream tasks.
Researcher Affiliation	Academia	1State Key Laboratory of Internet of Things for Smart City, Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, China 2Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, China.
Pseudocode	No	The paper describes methods in regular paragraph text and uses diagrams (e.g., Figure 1, Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/KAHIMWONG/E2E LLM WM.
Open Datasets	Yes	We use samples from the Wiki Text-103 dataset (Merity et al., 2017) as prompts for training and C4 (Raffel et al., 2020) for evaluation.
Dataset Splits	Yes	To train our end-to-end model, we choose OPT-1.3B as the online LLM to reduce training cost. We use samples from the Wiki Text-103 dataset (Merity et al., 2017) as prompts for training and C4 (Raffel et al., 2020) for evaluation. The first 30 tokens from the C4 dataset (Raffel et al., 2020) as prompts, and generate 200 clean watermarked tokens as a watermark sample, with original human-written text serving as non-watermarked samples.
Hardware Specification	Yes	All experiments are conducted on one single NVIDIA RTX A6000 48G GPU.
Software Dependencies	No	The paper mentions several tools and algorithms like 'Adam optimizer', 'OPT-1.3B', 'MGDA', and 'Gumbel-Softmax sampling (GSS)' but does not provide specific version numbers for software libraries such as Python, PyTorch, or CUDA that would be needed to replicate the experiment.
Experiment Setup	Yes	Table 15 provides detailed hyperparameters for the end-to-end model training, including Learning rate (1e-4), Batch size (8), Training step (35k), Encoder context size (10), Top-k candidate (20), Gumbel-softmax temperature (0.1), Watermark strength (1), and weights for Ldec (10) and Lsem (1).