A Transfer Attack to Image Watermarks
Authors: Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, Neil Gong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate our transfer attack on image datasets from Stable Diffusion and Midjourney, using multiple watermarking methods (Zhu et al., 2018; Tancik et al., 2020; Fernandez et al., 2023; Jiang et al., 2024). Our attack, using dozens of surrogate models, successfully evades watermark detectors while maintaining image quality (see examples in Figure 1). This holds even when surrogate models differ from the target in algorithms, architectures, watermark lengths, and training datasets. Our attack also outperforms common post-processing, existing transfer attacks (Jiang et al., 2023; An et al., 2024), and the state-of-the-art purification method (Nie et al., 2022), showing that existing image watermarks are broken even in the no-box setting. We note that the effectiveness of our attack to a completely new target watermarking method is unclear, which we discuss in Section 7. |
| Researcher Affiliation | Academia | Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, Neil Zhenqiang Gong Duke University EMAIL |
| Pseudocode | Yes | Algorithm 1 (Appendix) outlines the procedure for finding the perturbation δ. |
| Open Source Code | Yes | Our code is available at: https://github.com/hifihyp/Watermark-Transfer-Attack. |
| Open Datasets | Yes | In our experiments, we utilize three publicly available datasets (Wang et al., 2023; Turc & Nemade, 2022; Images, 2023) generated by Stable Diffusion, Midjourney, and DALL-E 2. |
| Dataset Splits | Yes | Each training set contains 10,000 images, and each testing set contains 1,000 images. The details of the datasets are introduced in Appendix H. For testing, we randomly sample 1,000 images from the testing set of each dataset, embed the ground-truth watermark into each of them using a target encoder, and then find the perturbation to each watermarked image using different methods. To train the surrogate watermarking models, we sample 10,000 images from another public dataset (Images, 2023) generated by DALL-E 2, i.e., the surrogate dataset consists of these 10,000 images. |
| Hardware Specification | Yes | Table 2: Computational cost comparison of existing attacks and our transfer attack on a single NVIDIA RTX-6000 GPU. |
| Software Dependencies | No | The paper does not explicitly state specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | By default, we set maximum number of iterations max iter = 5, 000, perturbation budget r = 0.25, sensitivity ϵ = 0.2, and learning rate α = 0.1 for our transfer attack. Unless otherwise mentioned, we use Inverse-Decode to select a target watermark for a surrogate decoder, and Ensemble Optimization to find the perturbation. α is increased when the number of surrogate watermarking models increases in order to satisfy the constraints of our optimization problem within 5, 000 iterations. The detailed settings for α for different number of surrogate models are shown in Table 1 in Appendix. Moreover, we use ℓ2-distance as the distance metric l( , ) for two watermarks. For the detection threshold τ, we set it based on the watermark length of the target watermarking model. Specifically, we set τ to be a value such that the false positive rate of the watermark-based detector is no larger than 10-4 when the double-tail detector is employed. Specifically, τ is set to be 0.9, 0.83, and 0.73 for the target watermarking models with watermark lengths of 20 bits, 30 bits, and 64 bits, respectively. |