reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Optimizing Adaptive Attacks against Watermarks for Language Models

Authors: Abdulrahman Diaa, Toluwani Aremu, Nils Lukas

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation shows that (i) adaptive attacks evade detection against all surveyed watermarks, (ii) training against any watermark succeeds in evading unseen watermarks, and (iii) optimization-based attacks are cost-effective. Our findings underscore the need to test robustness against adaptively tuned attacks.
Researcher Affiliation	Academia	1David R. Cheriton School of Computer Science, University of Waterloo, Ontario, Canada 2Mohammed Bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE.
Pseudocode	Yes	Algorithm 1 curates a preference dataset to optimize the adaptive attack s objective in Equation (2).
Open Source Code	Yes	We release our adaptively tuned paraphrasers at https://github.com/nilslukas/ada-wm-evasion.
Open Datasets	Yes	The evaluation set consists of 296 prompts from Piet et al. (2023), covering book reports, storytelling, and fake news.
Dataset Splits	Yes	The evaluation set consists of 296 prompts from Piet et al. (2023), covering book reports, storytelling, and fake news. The training set comprises a synthetic dataset of 1 000 prompts, covering diverse topics including reviews, historical summaries, biographies, environmental issues, science, mathematics, news, recipes, travel, social media, arts, social sciences, music, engineering, coding, sports, politics and health.
Hardware Specification	Yes	We report all runtimes on NVIDIA A100 GPUs accelerated using VLLM (Kwon et al., 2023) for inference and Deep Speed (Microsoft, 2021) for training.
Software Dependencies	No	Our implementation uses Py Torch and the Transformer Reinforcement Learning (TRL) library (von Werra et al., 2020). We use the open-source repository by Piet et al. (2023), which implements the four surveyed watermarking methods. (No specific version numbers are provided for Py Torch or TRL, nor for VLLM or Deep Speed, only the tools themselves are mentioned with citations).
Experiment Setup	Yes	We train our paraphraser models using the following hyperparameters: a batch size of 32, a learning rate of 5 10 4, and a maximum sequence length of 512 tokens. We use the Adam W optimizer with a linear learning rate scheduler that warms up the learning rate for the first 20% of the training steps and then linearly decays it to zero. We train the models for 1 epoch only to prevent overfitting. We utilize Low-Rank Adaptation (Lo RA) (Hu et al., 2022) to reduce the number of trainable parameters in the model. We set the rank to 32 and the alpha parameter to 16.