reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SegLLM: Multi-round Reasoning Segmentation with Large Language Models

Authors: Xudong Wang, Shaolun Zhang, Shufan Li, Kehan Li, Konstantinos Kallidromitis, Yusuke Kato, Kazuki Kozuka, trevor darrell

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluated on the newly curated MRSeg benchmark, Seg LLM outperforms existing methods in multiround interactive reasoning segmentation by over 20%. Additionally, we observed that training on multi-round reasoning segmentation data enhances performance on standard single-round referring segmentation and localization tasks, resulting in a 5.5% increase in c Io U for referring expression segmentation and a 4.5% improvement in Acc@0.5 for referring expression localization. Through extensive experiments, we demonstrate that Seg LLM outperforms previous state-of-the-art models by 18 30% on our multi-round reasoning segmentation benchmarks, MRSeg.
Researcher Affiliation	Collaboration	1UC Berkeley 2UCLA 3Panasonic AI Research 4Stanford
Pseudocode	No	The paper describes methods and pipelines, including architectural diagrams and mathematical formulations, but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for the methodology described.
Open Datasets	Yes	We constructed our multi-round image reasoning segmentation dataset (MRSeg) based on several widely utilized datasets, and include data from the following sources: Ref COCO(+/g) (Yu et al., 2016; Kazemzadeh et al., 2014), Visual Genome (Krishna et al., 2017), PACO-LVIS (Ramanathan et al., 2023), LVIS (Gupta et al., 2019), Pascal Panoptic Part (de Geus et al., 2021), ADE20K(Zhou et al., 2017), COCO-Stuff(Caesar et al., 2016) and MSCOCO(Lin et al., 2014b). We use the following dataset COCO (Attribution-Non Commercial-Share Alike 4.0 Internationa), Ref COCO (Apache-2.0 license), Visual Genome (Creative Commons Attribution 4.0 International License.), PACO (MIT License), Pascal-Panoptic-Parts ( Apache-2.0 license), LIVIS (CC BY 4.0 + COCO license).
Dataset Splits	Yes	We document the number of images sampled from each source dataset and the number of conversations generated in Table A1. Additionally, we visualize the distribution of the number of rounds for each dataset in Fig. A1. Table A1: Statistics of our MRSeg dataset, including the number of overall conversations, number of images, and the maximum rounds of conversations for each dataset after processing through our dataset pipeline.
Hardware Specification	Yes	We use NVIDIA A100 GPUs for model training.
Software Dependencies	Yes	We use a pretrained CLIP-Vi T-Large (Radford et al., 2021) with a patch size of 14 as the image encoder, HIPIE-R50 (Wang et al., 2024b) as the mask encoder and LLa VA-v1.5-7B (Liu et al., 2024) as the base language model. Compared with LISA, which has exactly one mask per training sample, Seg LLM s setup contains multiple masks per conversation. Hence, we replaced the SAM Vi T-H mask decoder (Kirillov et al., 2023) with a smaller HIPIE-R50 (Wang et al., 2024b) to reduce the computation overhead during the training, We then fine-tune the LLM model and the projector weights f V2L using the training set of our own multi-round instruction-segmentation dataset MRSeg, while keeping the weights of the CLIP image encoder and the HIPIE mask decoder frozen. Furthermore, we utilize stage-2 Deep Speed accelerator (Rasley et al., 2020) and bf16 floating point precision to enhance training efficiency and reduce memory consumption.
Experiment Setup	Yes	We fine-tune our model with a total batch size of 16 (a per-device batch size of 2) using the Adam W optimizer (Loshchilov & Hutter, 2017) with a learning rate of 2e 5. Furthermore, we utilize stage-2 Deep Speed accelerator (Rasley et al., 2020) and bf16 floating point precision to enhance training efficiency and reduce memory consumption.