SegLLM: Multi-round Reasoning Segmentation with Large Language Models
Authors: Xudong Wang, Shaolun Zhang, Shufan Li, Kehan Li, Konstantinos Kallidromitis, Yusuke Kato, Kazuki Kozuka, trevor darrell
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluated on the newly curated MRSeg benchmark, Seg LLM outperforms existing methods in multiround interactive reasoning segmentation by over 20%. Additionally, we observed that training on multi-round reasoning segmentation data enhances performance on standard single-round referring segmentation and localization tasks, resulting in a 5.5% increase in c Io U for referring expression segmentation and a 4.5% improvement in Acc@0.5 for referring expression localization. Through extensive experiments, we demonstrate that Seg LLM outperforms previous state-of-the-art models by 18 30% on our multi-round reasoning segmentation benchmarks, MRSeg. |
| Researcher Affiliation | Collaboration | 1UC Berkeley 2UCLA 3Panasonic AI Research 4Stanford |
| Pseudocode | No | The paper describes methods and pipelines, including architectural diagrams and mathematical formulations, but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the source code for the methodology described. |
| Open Datasets | Yes | We constructed our multi-round image reasoning segmentation dataset (MRSeg) based on several widely utilized datasets, and include data from the following sources: Ref COCO(+/g) (Yu et al., 2016; Kazemzadeh et al., 2014), Visual Genome (Krishna et al., 2017), PACO-LVIS (Ramanathan et al., 2023), LVIS (Gupta et al., 2019), Pascal Panoptic Part (de Geus et al., 2021), ADE20K(Zhou et al., 2017), COCO-Stuff(Caesar et al., 2016) and MSCOCO(Lin et al., 2014b). We use the following dataset COCO (Attribution-Non Commercial-Share Alike 4.0 Internationa), Ref COCO (Apache-2.0 license), Visual Genome (Creative Commons Attribution 4.0 International License.), PACO (MIT License), Pascal-Panoptic-Parts ( Apache-2.0 license), LIVIS (CC BY 4.0 + COCO license). |
| Dataset Splits | Yes | We document the number of images sampled from each source dataset and the number of conversations generated in Table A1. Additionally, we visualize the distribution of the number of rounds for each dataset in Fig. A1. Table A1: Statistics of our MRSeg dataset, including the number of overall conversations, number of images, and the maximum rounds of conversations for each dataset after processing through our dataset pipeline. |
| Hardware Specification | Yes | We use NVIDIA A100 GPUs for model training. |
| Software Dependencies | Yes | We use a pretrained CLIP-Vi T-Large (Radford et al., 2021) with a patch size of 14 as the image encoder, HIPIE-R50 (Wang et al., 2024b) as the mask encoder and LLa VA-v1.5-7B (Liu et al., 2024) as the base language model. Compared with LISA, which has exactly one mask per training sample, Seg LLM s setup contains multiple masks per conversation. Hence, we replaced the SAM Vi T-H mask decoder (Kirillov et al., 2023) with a smaller HIPIE-R50 (Wang et al., 2024b) to reduce the computation overhead during the training, We then fine-tune the LLM model and the projector weights f V2L using the training set of our own multi-round instruction-segmentation dataset MRSeg, while keeping the weights of the CLIP image encoder and the HIPIE mask decoder frozen. Furthermore, we utilize stage-2 Deep Speed accelerator (Rasley et al., 2020) and bf16 floating point precision to enhance training efficiency and reduce memory consumption. |
| Experiment Setup | Yes | We fine-tune our model with a total batch size of 16 (a per-device batch size of 2) using the Adam W optimizer (Loshchilov & Hutter, 2017) with a learning rate of 2e 5. Furthermore, we utilize stage-2 Deep Speed accelerator (Rasley et al., 2020) and bf16 floating point precision to enhance training efficiency and reduce memory consumption. |