VideoShield: Regulating Diffusion-based Video Generation Models via Watermarking
Authors: Runyi Hu, Jie Zhang, Yiming Li, Jiwei Li, Qing Guo, Han Qiu, Tianwei Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across various video models (both T2V and I2V models) demonstrate that our method effectively extracts watermarks and detects tamper without compromising video quality. Furthermore, we show that this approach is applicable to image generation models, enabling tamper detection in generated images as well. |
| Researcher Affiliation | Academia | Runyi Hu1, Jie Zhang2 , Yiming Li1, Jiwei Li3, Qing Guo2, Han Qiu4, Tianwei Zhang1 1Nanyang Technological University 2CFAR and IHPC, A*STAR, Singapore 3Zhejiang University 4Tsinghua University |
| Pseudocode | No | The paper describes the methodology in prose and figures, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes and models are available at https://github.com/hurunyi/Video Shield. |
| Open Datasets | Yes | For MS, we choose 50 prompts from the VBench (Huang et al., 2024) test set, covering five categories: Animal, Human, Plant, Scenery, and Vehicles, with 10 prompts per category. ... For SVD, we first employ a text-to-image (T2I) model, specifically Stable Diffusion 2.1 (AI, 2022), to generate 200 images...Additionally, we gather 5 real images from each of the 5 categories...We randomly sample 500 prompts from the Stable Diffusion-Prompts dataset2 to generate 500 images at a resolution of 512. |
| Dataset Splits | Yes | For MS, we choose 50 prompts from the VBench (Huang et al., 2024) test set, covering five categories: Animal, Human, Plant, Scenery, and Vehicles, with 10 prompts per category. For each prompt, we generate 4 videos, resulting in a total of 50 × 4 = 200 videos for evaluation. For SVD, we first employ a text-to-image (T2I) model, specifically Stable Diffusion 2.1 (AI, 2022), to generate 200 images corresponding to the 200 prompts used in the MS evaluation. These images are then used to create 200 videos for evaluation. Additionally, we gather 5 real images from each of the 5 categories, generating a total of 5 × 5 × 4 = 100 videos for evaluation. Except for evaluating spatial tamper localization with STTN and Pro Painter, where we use 1/5 of the generated videos for manual annotation (as detailed in Appendix D.3), we use all the generated videos in other cases by default. To statistically analyze Awm and Aorig to obtain twm and torig, we further generate 100 watermarked videos and 100 original videos that are not included in the aforementioned dataset for both MS and SVD. |
| Hardware Specification | Yes | We provide the computation overhead of VIDEOSHIELD in Table 16. The primary GPU memory usage and runtime overhead are concentrated in the DDIM inversion stage. However, as shown in the table for step = 10 and step = 25, reducing the number of inversion steps can significantly decrease the runtime, with only a slight sacrifice in performance as shown in Table 12. Model #Params Resolution GPU (GB) Runtime (s) Step=10 Step=25 Watermark Spatial Temporal MS 1.83B 256 3.77 1.2408 3.0617 0.0011 0.0019 0.0004 SVD 2.25B 512 5.32 4.3214 10.2027 0.0023 0.0019 0.0004 ZS 1.83B 256 3.77 1.2430 3.0492 0.0011 0.0019 0.0004 I2VGen 2.48B 512 5.99 4.6700 11.1526 0.0023 0.0019 0.0004 evaluated on a single NVIDIA RTX A6000 GPU (49 GB) in FP16 mode. |
| Software Dependencies | No | The paper does not explicitly state specific software dependencies with version numbers. |
| Experiment Setup | Yes | Implementation details. We select two popular open-source models as the default test models: the text-to-video (T2V) model Model Scope (MS) (Wang et al., 2023) and the image-to-video (I2V) model Stable-Video-Diffusion (SVD) (Blattmann et al., 2023). Videos are generated with 16 frames in FP16 mode for both models. The resolutions of the videos generated by the MS and SVD models are 256 and 512, respectively. We use the default sampler and text (image) guidance, with 25 inference steps and 25 inversion steps for both models. A total of 512 watermark bits are embedded into the generated videos. To achieve this, we set kf, kc, kh, kw to 8, 1, 4, 4 for MS and 8, 1, 8, 8 for SVD. For MS, k in PTB is set to 99, while it is set to 98 for SVD. For both models, ttemp and L are set to 0.55 and 3, respectively. |