WMarkGPT: Watermarked Image Understanding via Multimodal Large Language Models
Authors: Songbai Tan, Xuerui Qiu, Yao Shu, Gang Xu, Linrui Xu, Xiangyu Xu, Huiping Zhuang, Ming Li, Fei Yu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on synthetic and real watermarking QA datasets demonstrate that WMark GPT outperforms existing MLLMs, achieving significant improvements in visibility prediction and content description. The experimental results, detailed in Tab. 1, show that WMark GPT significantly outperforms the other models in terms of watermark description relevance, achieving higher BLEU-1, ROUGE-L, and LLM-Score values. Moreover, WMark GPT demonstrates superior visibility prediction accuracy, achieving a 217.4% higher ACC than the second-best model, VILA-8B. To further investigate watermark description quality across different visibility levels, we evaluate these models on the WQA-Real dataset across five visibility categories. As shown in Tab. 2, WMark GPT consistently achieves the highest BLEU-1 scores for watermarked image descriptions across all visibility levels, highlighting its robustness. 4.2. Ablation Study To investigate the impact of the WQA-Synthetic dataset size on model per-Table 4. The impact of using varying size of WQA-Synthetic data for fine-tuning. As the training proportion increases, the model s ability to understand watermarks improves, resulting in more accurate watermark descriptions.Table 5. The performance of the model on WQA-Real after three different training stages. S1, S2, and S3 represent the Stage-1, Stage-2, and Stage-3, respectively. |
| Researcher Affiliation | Academia | 1School of management, Shenzhen University 2Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) 3Institute of automation, Chinese Academy of Sciences 4School of Geosciences and Info-Physics, Central South University 5Xi an Jiaotong University, China 6Shien-Ming Wu School of Intelligent Engineering, South China University of Technology. Correspondence to: Ming Li <EMAIL>. |
| Pseudocode | No | The paper describes a three-stage learning pipeline and detailed steps for dataset generation using figures and prose (e.g., 'The semi-automatic annotation pipeline of WQA-Synthetic is shown in Fig. 5. We randomly select 50K images from the COCO dataset (Lin et al., 2014) and 50K watermark logos from the LOGO-2K dataset (Wang et al., 2020), and then synthesize watermarked images while generating corresponding watermark descriptions using a four-step process: (1) Step-1: watermark segmentation; (2) Step-2: main object bounding box detection; (3) Step-3: watermarked image synthesis; and (4) Step-4: question-answering generation. Each stage is outlined in detail below.'), but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The datasets and code are released at https: //github.com/Tan Song Bai/WMark GPT. |
| Open Datasets | Yes | To support this learning, we construct three custom visual question-answering (VQA) datasets. The first is an object location-aware dataset, built upon the COCO dataset, which includes image captions and object bounding boxes. ... These datasets will be publicly released to advance future research. The datasets and code are released at https: //github.com/Tan Song Bai/WMark GPT. |
| Dataset Splits | Yes | In our experimental setup, we randomly select 5K and 0.5K watermarked images from the WQA-Synthetic and WQA-Real dataset correspondingly to construct a diverse test set, and use the remaining images as the training set. |
| Hardware Specification | Yes | all experiments were conducted on 8 NVIDIA RTX 6000 Ada GPUs. |
| Software Dependencies | No | The paper mentions using the Adam optimizer and cosine annealing learning rate schedule, but does not specify software names with version numbers (e.g., PyTorch 1.9, Python 3.8, CUDA 11.1). |
| Experiment Setup | Yes | In Stage-1, we generate 100K position question-answer pairs to train the vision encoder and visual abstractor, with a batch size of 32, a learning rate of 1 10 4, and for a duration of 3 epochs. ... In Stage-2 and Stage-3, following the configuration in the m PLUG-owl2, we apply fine-tuning with a batch size of 16, a learning rate of 2 10 5, and a duration of 5 epochs. Across all three stages, the Adam optimizer and cosine annealing learning rate schedule are used to dynamically adjust the learning rate. |