Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation
Authors: Zhihua Liu, Amrutha Saseendran, Lei Tong, Xilin He, Fariba Yousefi, Nikolay Burlutskiy, Dino Oglic, Tom Diethe, Philip Alexander Teare, Huiyu Zhou, Chen Jin
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed approach is effective, generalizes across different open-set segmentation tasks, and achieves state-of-the-art results of 52.5 (+6.8 relative) m Io U on Pascal Context 59, 67.73 (+25.73 relative) c Io U on g Ref COCO, and 67.4 (+1.1 relative to fine-tuned methods) m Io U on Gran Df, which is the most complex open-set grounded segmentation task in the field. |
| Researcher Affiliation | Collaboration | 1School of Computing and Mathematical Sciences, University of Leicester, UK 2Centre for AI, Data Science & Artificial Intelligence, Bio Pharmaceuticals R&D, Astra Zeneca, Cambridge, UK 3Shenzhen University. Correspondence to: Huiyu Zhou <EMAIL>, Chen Jin <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Segment Anyword (pseudo code) |
| Open Source Code | Yes | Project page, code, and data are available at https://zhihualiued.github.io/segment_anyword |
| Open Datasets | Yes | We perform extensive experiments on six multi-modal image segmentation datasets, including open-set language grounded segmentation dataset Gran Df (Rasheed et al., 2024), multi object reference image segmentation dataset g Ref COCO (Liu et al., 2023), single object reference image segmentation dataset Ref COCO, Ref COCO+ and Ref COCOg (Kazemzadeh et al., 2014), open-vocabulary semantic segmentation on Pascal Context (Mottaghi et al., 2014). |
| Dataset Splits | Yes | Gran Df ... comprises 214K image-grounded text pairs, along with 2.5K validation samples and 5K test samples... Ref COCO ... divided into 120,624 training, 10,834 validation, 5,657 test A, and 5,095 test B samples. Ref COCO+ ... with 120,624 training, 10,758 validation, 5,726 test A, and 4,889 test B samples. Ref COCOg ... comprises 104,560 referring expressions for 54,822 objects across 26,711 images... g Ref COCO ... The validation set contains 1,485 images with 5,324 sentences, while test A includes 750 images with 8,825 sentences, and test B consists of 749 images with 5,744 sentences. PASCAL Context ... with 5,100 images in validation set. |
| Hardware Specification | Yes | Our experiments were executed on a single 40G A100 GPU with a batch size of 8. ... All experiments were conducted on a single NVIDIA A100 40GB GPU. |
| Software Dependencies | Yes | We choose the fine-tuned version of Vicuna-7B-v1.5 (Zheng et al., 2023) as our large language model (LLM) to parse the text prompt and generate the noun phrases... For the post-processing module, we utilize a frozen SAM with Vi T-H as the promptable mask generator. |
| Experiment Setup | Yes | The base learning rate for textual embedding was set to 0.005. The hyper-parameters of textual embedding updating remains the same in LDM and MCPL, with the temperature and scaling term (τ, γ) of (0.3, 0.00075). We use BERT (Devlin, 2018) to generate token embeddings. For words included in BERT s pre-trained vocabulary, we directly use their pre-trained embeddings. ... With Lo RA fine-tuned BERT text encoder, Segment Anywordf achieve a fast inference time text domain adaptation, decreasing textual embedding update steps from 1100 to 50... |