Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model
Authors: Xu Yuan, Li Zhou, Zenghui Sun, Zikun Zhou, Jinsong Lan
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our MGLMM excels at tackling more than eight downstream tasks and achieves state-of-the-art performance in MGSC, GCG, image captioning, referring segmentation, multiple/empty segmentation, and reasoning segmentation. |
| Researcher Affiliation | Collaboration | 1The Hong Kong Polytechnic University, HK SAR 2TAO Technology, Alibaba Group, China 3Pengcheng Laboratory, China |
| Pseudocode | No | The paper describes methods in prose and diagrams (Figure 3, Figure 4) but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/lizhou-cs/mglmm |
| Open Datasets | Yes | Observing the lack of a benchmark for model training and evaluation over the MGSC task, we establish a benchmark with aligned masks and captions in multi-granularity using our customized automated annotation pipeline. This benchmark comprises 10K images and more than 30K image-question pairs. We will release our dataset along with the implementation of our automated dataset annotation pipeline for further research. Our training dataset is composed of six parts: (1) semantic segmentation: ADE20K (Zhou et al. 2019), COCO-Stuff (Caesar, Uijlings, and Ferrari 2018), Maplilary Vistas (Neuhold et al. 2017), PACO-LVIS (Ramanathan et al. 2023), and PASCAL-Part (Chen et al. 2014) ; (2) referring segmentation: Ref CLEF (Jing et al. 2021) and the Ref COCO series (Yu et al. 2016); (3) COCO Caption (Chen et al. 2015) for image-level caption; (4) LLa VA-150k (Liu et al. 2024b) for basic VQA ability; (5) Gran Df (Rasheed et al. 2024) for grounded conversation generation; (6) our proposed MGSCData for multi-granularity Seg Cap. |
| Dataset Splits | Yes | Following the same settings, we finetune the GLa MM and our MGLMM on the training set of MGSCData and evaluate them using the same metric. We achieve significant lead performances over recent works like GLa MM and OMG-LLa VG on the ref COCO/+/g validation and test sets. For the reasoning segmentation, we utilize the validation set of Reason Seg dataset (Lai et al. 2024) as the benchmark. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU types, or memory) for running its experiments in the main text. |
| Software Dependencies | No | The paper mentions using specific models like Vicuna-7B, CLIP, and SAM, but does not provide specific version numbers for underlying software libraries, programming languages, or development environments (e.g., Python, PyTorch versions). |
| Experiment Setup | No | The paper states that the model is trained with a joint training setting and uses CE loss for text generation and BCE and DICE loss for mask prediction. It mentions using Vicuna-7B as the LLM structure. However, it defers 'Further implementation details' to Appendix D and does not provide specific hyperparameters such as learning rate, batch size, or number of epochs in the main text. |