Optimizing Adaptive Attacks against Watermarks for Language Models
Authors: Abdulrahman Diaa, Toluwani Aremu, Nils Lukas
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation shows that (i) adaptive attacks evade detection against all surveyed watermarks, (ii) training against any watermark succeeds in evading unseen watermarks, and (iii) optimization-based attacks are cost-effective. Our findings underscore the need to test robustness against adaptively tuned attacks. |
| Researcher Affiliation | Academia | 1David R. Cheriton School of Computer Science, University of Waterloo, Ontario, Canada 2Mohammed Bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE. |
| Pseudocode | Yes | Algorithm 1 curates a preference dataset to optimize the adaptive attack s objective in Equation (2). |
| Open Source Code | Yes | We release our adaptively tuned paraphrasers at https://github.com/nilslukas/ada-wm-evasion. |
| Open Datasets | Yes | The evaluation set consists of 296 prompts from Piet et al. (2023), covering book reports, storytelling, and fake news. |
| Dataset Splits | Yes | The evaluation set consists of 296 prompts from Piet et al. (2023), covering book reports, storytelling, and fake news. The training set comprises a synthetic dataset of 1 000 prompts, covering diverse topics including reviews, historical summaries, biographies, environmental issues, science, mathematics, news, recipes, travel, social media, arts, social sciences, music, engineering, coding, sports, politics and health. |
| Hardware Specification | Yes | We report all runtimes on NVIDIA A100 GPUs accelerated using VLLM (Kwon et al., 2023) for inference and Deep Speed (Microsoft, 2021) for training. |
| Software Dependencies | No | Our implementation uses Py Torch and the Transformer Reinforcement Learning (TRL) library (von Werra et al., 2020). We use the open-source repository by Piet et al. (2023), which implements the four surveyed watermarking methods. (No specific version numbers are provided for Py Torch or TRL, nor for VLLM or Deep Speed, only the tools themselves are mentioned with citations). |
| Experiment Setup | Yes | We train our paraphraser models using the following hyperparameters: a batch size of 32, a learning rate of 5 10 4, and a maximum sequence length of 512 tokens. We use the Adam W optimizer with a linear learning rate scheduler that warms up the learning rate for the first 20% of the training steps and then linearly decays it to zero. We train the models for 1 epoch only to prevent overfitting. We utilize Low-Rank Adaptation (Lo RA) (Hu et al., 2022) to reduce the number of trainable parameters in the model. We set the rank to 32 and the alpha parameter to 16. |