A Transfer Attack to Image Watermarks

Authors: Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, Neil Gong

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate our transfer attack on image datasets from Stable Diffusion and Midjourney, using multiple watermarking methods (Zhu et al., 2018; Tancik et al., 2020; Fernandez et al., 2023; Jiang et al., 2024). Our attack, using dozens of surrogate models, successfully evades watermark detectors while maintaining image quality (see examples in Figure 1). This holds even when surrogate models differ from the target in algorithms, architectures, watermark lengths, and training datasets. Our attack also outperforms common post-processing, existing transfer attacks (Jiang et al., 2023; An et al., 2024), and the state-of-the-art purification method (Nie et al., 2022), showing that existing image watermarks are broken even in the no-box setting. We note that the effectiveness of our attack to a completely new target watermarking method is unclear, which we discuss in Section 7.
Researcher Affiliation Academia Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, Neil Zhenqiang Gong Duke University EMAIL
Pseudocode Yes Algorithm 1 (Appendix) outlines the procedure for finding the perturbation δ.
Open Source Code Yes Our code is available at: https://github.com/hifihyp/Watermark-Transfer-Attack.
Open Datasets Yes In our experiments, we utilize three publicly available datasets (Wang et al., 2023; Turc & Nemade, 2022; Images, 2023) generated by Stable Diffusion, Midjourney, and DALL-E 2.
Dataset Splits Yes Each training set contains 10,000 images, and each testing set contains 1,000 images. The details of the datasets are introduced in Appendix H. For testing, we randomly sample 1,000 images from the testing set of each dataset, embed the ground-truth watermark into each of them using a target encoder, and then find the perturbation to each watermarked image using different methods. To train the surrogate watermarking models, we sample 10,000 images from another public dataset (Images, 2023) generated by DALL-E 2, i.e., the surrogate dataset consists of these 10,000 images.
Hardware Specification Yes Table 2: Computational cost comparison of existing attacks and our transfer attack on a single NVIDIA RTX-6000 GPU.
Software Dependencies No The paper does not explicitly state specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes By default, we set maximum number of iterations max iter = 5, 000, perturbation budget r = 0.25, sensitivity ϵ = 0.2, and learning rate α = 0.1 for our transfer attack. Unless otherwise mentioned, we use Inverse-Decode to select a target watermark for a surrogate decoder, and Ensemble Optimization to find the perturbation. α is increased when the number of surrogate watermarking models increases in order to satisfy the constraints of our optimization problem within 5, 000 iterations. The detailed settings for α for different number of surrogate models are shown in Table 1 in Appendix. Moreover, we use ℓ2-distance as the distance metric l( , ) for two watermarks. For the detection threshold τ, we set it based on the watermark length of the target watermarking model. Specifically, we set τ to be a value such that the false positive rate of the watermark-based detector is no larger than 10-4 when the double-tail detector is employed. Specifically, τ is set to be 0.9, 0.83, and 0.73 for the target watermarking models with watermark lengths of 20 bits, 30 bits, and 64 bits, respectively.