Text4Seg: Reimagining Image Segmentation as Text Generation

Authors: Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, Wei Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across various vision tasks, such as referring expression segmentation and comprehension, show that Text4Seg achieves state-of-the-art performance on multiple datasets by fine-tuning different MLLM backbones. Our approach provides an efficient, scalable solution for vision-centric tasks within the MLLM framework. 4 EXPERIMENTS
Researcher Affiliation Collaboration Mengcheng Lan, Chaofeng Chen, Yue Zhou S-Lab, Nanyang Technological University EMAIL EMAIL Jiaxing Xu, Yiping Ke CCDS, Nanyang Technological University EMAIL EMAIL Xinjiang Wang, Litong Feng ,Wayne Zhang Sense Time Research EMAIL
Pseudocode No The paper describes its methodology in text and uses diagrams (Fig. 1, Fig. 2, Fig. 3, Fig. 5) to illustrate concepts but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/mc-lan/Text4Seg
Open Datasets Yes For referring expression segmentation (RES), we follow standard evaluation protocols (Lai et al., 2024; Xia et al., 2024) and assess our method using the ref COCO series. We construct the referring segmentation dataset by combining the train split of ref CLEF, ref COCO, ref COCO+ (Kazemzadeh et al., 2014), and ref COCOg (Mao et al., 2016). For open-vocabulary segmentation task, we utilize all three types of question-answer templates. Specifically, we construct our visual instruction data using the COCOStuff dataset. We evaluate the model s performance on ADE20K (A-150) (Zhou et al., 2019), PASCAL Context 59 (PC-59) (Mottaghi et al., 2014), and PASCAL VOC 20 (PAS-20) (Everingham, 2009) datasets, using m Io U as the evaluation metric.
Dataset Splits Yes We construct the referring segmentation dataset by combining the train split of ref CLEF, ref COCO, ref COCO+, and ref COCOg, resulting in a dataset of 800k samples. Our model is trained on this dataset for 5 epochs. Additionally, to evaluate the performance on a multi-object/non-object segmentation task, we construct a generalized referring expression segmentation dataset with 419k samples using the train split of gref COCO (Liu et al., 2023a). We continue to fine-tune the model for 2 epochs. For a comprehensive comparison, we also report the performance of the LLa VA-1.5-7B model based on our implementation. Our method, Text4Seg, built upon the stage-2 of LLa VA-1.5-7B, is trained on both the LLa VA-v1.5mix665k dataset and our referring segmentation datasets. The ratio of open-vocabulary segmentation templates, partial segmentation templates, and conditioned segmentation templates is set to 1 : 3 : 6. To further enhance diversity, we apply random cropping to both the image and mask. By iterating 10 times over the COCOStuff train set, we ultimately generate a training dataset consisting of 1.16M samples.
Hardware Specification Yes All models are trained on 8 Tesla A800 GPUs (40GB) with a global batch size of 128.
Software Dependencies No Our method is implemented using SWIFT (Zhao et al., 2024). The paper mentions SWIFT and specific MLLM backbones like LLaVA-1.5, Qwen-VL, Deepseek VL, Intern VL2, and SAM, but it does not provide specific version numbers for these software components or any other libraries like Python, PyTorch, etc.
Experiment Setup Yes All models are trained on 8 Tesla A800 GPUs (40GB) with a global batch size of 128. We use the Adam W optimizer (Loshchilov, 2017), starting with an initial learning rate of 2e-4, which follows a linear decay schedule after a warm-up phase with a ratio of 0.03. The weight decay is set to 0, and gradient norms are clipped at 1.0. To minimize GPU memory usage, we fine-tune all models using Lo RA with a rank of 64, along with Ze RO-2 stage memory optimization. Table 7: Hyper-parameters and training settings for RES task. Param Name: Type, Value: Adam W. Param Name: Learning rate, Value: 2e-4. Param Name: Weight decay, Value: 0.0. Param Name: (β1, β2), Value: (0.9, 0.95). Param Name: Gradient norm clip, Value: 1.0. Param Name: Scheduler, Value: Linearly decay. Param Name: Warmup ratio, Value: 0.03. Param Name: Rank, Value: 64. Param Name: Alpha (α), Value: 128. Param Name: Dropout, Value: 0.05. Param Name: Global batch size, Value: 128. Param Name: Number of samples per epoch, Value: 800k. Param Name: Total epochs, Value: 5.