Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Authors: Weifeng Lin, Xinyu Wei, Ruichuan An, Gao Peng, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, Hongsheng Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results demonstrate that our framework can be easily and effectively applied to various MLLMs, such as SPHINX-X and LLa VA. After training with MDVP-Instruct-Data and image-level instruction datasets, our models exhibit impressive multimodal interaction capabilities and pixel-level understanding, while maintaining their image-level visual perception performance. The code and related resources are available at https://draw-and-understand.github.io. To comprehensively evaluate the effectiveness of our VP-MLLMs, we first use LLa VA-Bench to compare with previous models and assess general image-level understanding capabilities. Additionally, we conducted evaluations on a series of modern MLLM benchmarks. As shown in Table 2, our VP-MLLMs exhibit leading performance across image-level benchmarks such as MME, SEED-Bench, POPE, and OCRBench compared to existing visual prompting methods. Furthermore, we use Ferret-Bench and our proposed MDVP-Bench to evaluate the pixel-level visual prompting capabilities of our models. Experiments are conducted on the Referring Description and Referring Reasoning tasks within Ferret-Bench, with the results presented in Table 3. For MDVP-Bench, we combine different tasks to provide a comprehensive assessment of the model s overall capabilities. To evaluate the effectiveness of the key elements of our approach, we conduct the ablation experiments. Given the extensive amount of training data, our comparison was limited to the first 50k training iterations. |
| Researcher Affiliation | Academia | Weifeng Lin1 Xinyu Wei2 Ruichuan An2 Peng Gao4 Bocheng Zou3 Yulin Luo2 Siyuan Huang4 Shanghang Zhang2 Hongsheng Li1 1CUHK 2Peking University 3University of Wisconsin Madison 4Shanghai AI Laboratory |
| Pseudocode | No | The paper describes the architecture and training strategy but does not include any clearly labeled pseudocode or algorithm blocks. The methods are described in narrative text and illustrated with diagrams. |
| Open Source Code | Yes | The code and related resources are available at https://draw-and-understand.github.io. |
| Open Datasets | Yes | Additionally, we introduce MDVP-Instruct-Data, a multi-domain dataset featuring 1.2 million image-visual prompt-text triplets, including natural images, document images, scene text images, mobile/web screenshots, and remote sensing images. Building on this dataset, we introduce MDVPBench, a challenging benchmark designed to evaluate a model s ability to understand visual prompting instructions. The code and related resources are available at https://draw-and-understand.github.io. The public datasets we utilized include Flickr30K (Plummer et al., 2015), Ref COCO/+(Yu et al., 2016), GCG(Rasheed et al., 2023), and GRIT (You et al., 2023), as well as Geo Chat (Kuckreja et al., 2023) from remote sense domain. All datasets used in stage 1 are open-source detection and segmentation datasets. A complete list of the datasets can be found in Table 5. |
| Dataset Splits | Yes | Experiments were conducted on the validation sets of Ref COCOg. Following (Yuan et al., 2024a), we use the METEOR and CIDEr scores to evaluate the semantic similarity between the generated captions and the ground truth. To evaluate the detailed region description capabilities, we leverage GPT-4 to measure the quality of responses for input referring regions. Following the approach used in Osprey (Yuan et al., 2024a), we sample 80 images from the Ref COCOg validation set (Yu et al., 2016) to generate detailed region captions using box visual prompts and text prompts like, Please provide a detailed description of each marked region in the image. Following the evaluation approach employed in LLa VA-Vi P (Liu et al., 2024a) and GPT4Ro I (Zhang et al., 2023a), we fine-tuned our models using the training set of VCR. In all evaluation experiments, we will not continue to fine-tune on a specific dataset but will instead adopt a zero-shot testing approach. |
| Hardware Specification | Yes | Both Stage 1 and Stage 2 training were conducted on 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions using AdamW as an optimizer and flash attention for efficiency, and refers to base models like LLa MA2, LLa MA3, and CLIP-ViT-L-14. However, it does not provide specific version numbers for software libraries or frameworks (e.g., PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | We employ Adam W (Loshchilov & Hutter, 2017) as our optimizer and leverage flash attention (Ford et al., 2009) to enhance computational efficiency. During the stage 1 training phase, we set the starting learning rate to 4e 5. In stage 2, the initial learning rate was adjusted to 1e 5. The input images were processed using each model s unique dynamic resolution mechanism, and the maximum sequence length for the Large Language Model (LLM) was set to 3072. Table 7: Hyperparameters of VP-MLLMs. Training Settings Stage 1 Stage 2 Batch Size 256 64 Training Epochs 1 1 Warmup Epochs 0.03 0.03 Learning Rate 4 10 5 1 10 5 LR schedule cosine decay cosine decay Gradient Clipping 8 8 Weight Decay 0 0 Optimizer Adam W Adam W |