reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

Authors: Xu Yuan, Li Zhou, Zenghui Sun, Zikun Zhou, Jinsong Lan

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our MGLMM excels at tackling more than eight downstream tasks and achieves state-of-the-art performance in MGSC, GCG, image captioning, referring segmentation, multiple/empty segmentation, and reasoning segmentation.
Researcher Affiliation	Collaboration	1The Hong Kong Polytechnic University, HK SAR 2TAO Technology, Alibaba Group, China 3Pengcheng Laboratory, China
Pseudocode	No	The paper describes methods in prose and diagrams (Figure 3, Figure 4) but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/lizhou-cs/mglmm
Open Datasets	Yes	Observing the lack of a benchmark for model training and evaluation over the MGSC task, we establish a benchmark with aligned masks and captions in multi-granularity using our customized automated annotation pipeline. This benchmark comprises 10K images and more than 30K image-question pairs. We will release our dataset along with the implementation of our automated dataset annotation pipeline for further research. Our training dataset is composed of six parts: (1) semantic segmentation: ADE20K (Zhou et al. 2019), COCO-Stuff (Caesar, Uijlings, and Ferrari 2018), Maplilary Vistas (Neuhold et al. 2017), PACO-LVIS (Ramanathan et al. 2023), and PASCAL-Part (Chen et al. 2014) ; (2) referring segmentation: Ref CLEF (Jing et al. 2021) and the Ref COCO series (Yu et al. 2016); (3) COCO Caption (Chen et al. 2015) for image-level caption; (4) LLa VA-150k (Liu et al. 2024b) for basic VQA ability; (5) Gran Df (Rasheed et al. 2024) for grounded conversation generation; (6) our proposed MGSCData for multi-granularity Seg Cap.
Dataset Splits	Yes	Following the same settings, we finetune the GLa MM and our MGLMM on the training set of MGSCData and evaluate them using the same metric. We achieve significant lead performances over recent works like GLa MM and OMG-LLa VG on the ref COCO/+/g validation and test sets. For the reasoning segmentation, we utilize the validation set of Reason Seg dataset (Lai et al. 2024) as the benchmark.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU types, or memory) for running its experiments in the main text.
Software Dependencies	No	The paper mentions using specific models like Vicuna-7B, CLIP, and SAM, but does not provide specific version numbers for underlying software libraries, programming languages, or development environments (e.g., Python, PyTorch versions).
Experiment Setup	No	The paper states that the model is trained with a joint training setting and uses CE loss for text generation and BCE and DICE loss for mask prediction. It mentions using Vicuna-7B as the LLM structure. However, it defers 'Further implementation details' to Appendix D and does not provide specific hyperparameters such as learning rate, batch size, or number of epochs in the main text.