Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation
Authors: Niccolò Avogaro, Thomas Frick, Mattia Rigotti, Andrea Bartezzaghi, Filip Janicki, A. Cristiano I. Malossi, Konrad Schindler, Roy Assaf
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We examine the seemingly obvious question: how to effectively prompt VLMs for semantic segmentation. To that end, we systematically evaluate the segmentation performance of several recent models guided by either text or visual prompts on the out-of-distribution MESS dataset collection. We introduce a scalable prompting scheme, few-shot prompted semantic segmentation, inspired by open-vocabulary segmentation and few-shot learning. It turns out that VLMs lag far behind specialist models trained for a specific segmentation task by about 30% on average on the Intersection-over-Union metric. Moreover, we find that text prompts and visual prompts are complementary: each one of the two modes fails on many examples that the other one can solve. Our analysis suggests that being able to anticipate the most effective prompt modality can lead to an 11% improvement in performance. Motivated by our findings, we propose Prompt Matcher, a remarkably simple training-free baseline that combines both text and visual prompts, achieving state-of-the-art results outperforming the best text-prompted VLM by 2.5%, and the top visual-prompted VLM by 3.5% on few-shot prompted semantic segmentation. |
| Researcher Affiliation | Collaboration | Niccolo Avogaro EMAIL IBM Research ETH Zurich Thomas Frick EMAIL IBM Research Mattia Rigotti EMAIL IBM Research Andrea Bartezzaghi EMAIL IBM Research Filip M. Janicki EMAIL IBM Research Cristiano Malossi EMAIL IBM Research Konrad Schindler EMAIL ETH Zurich Roy Assaf EMAIL IBM Research |
| Pseudocode | Yes | Pseudocode detailing inner workings of the framework can be found in Appendix A.6 |
| Open Source Code | No | The paper does not contain an explicit statement about releasing its own source code (e.g., for Prompt Matcher) or a link to a code repository. It mentions using various third-party models and frameworks but not providing its own implementation. |
| Open Datasets | Yes | As a testbed for our experiments, we use the MESS dataset collection (Blumenstiel et al., 2023). It consists of 22 different segmentation datasets that span a wide variety of application domains and image characteristics. The datasets are grouped into five broad domains, General (6 datasets), Earth (5), Medical (4), Engineering (4) and Agriculture (3) as detailed in Table 6. The MESS dataset collection was deliberately designed as a challenging benchmark for open-vocabulary models and is an ideal choice for evaluating foundation models. |
| Dataset Splits | Yes | As visual prompts, we sample one single image of the target class from the dataset itself, together with its ground truth segmentation mask. The minimal setup with a single prompt image, respectively an elementary text prompt, is a challenging and particularly user-friendly scenario. Picking the prompt image from the same dataset corresponds to the realistic scenario where the user creates the prompt on images acquired in their application setting, with similar imaging conditions and class definitions as the test data. To minimise biases due to the choice of prompt image, we sample a different prompt image for each prediction. |
| Hardware Specification | Yes | The evaluations were run on a single A100 with 40GB of memory, which takes 14 hours for one complete run with the largest model (LISA-13B). |
| Software Dependencies | No | The paper mentions various models and backbones used (e.g., CLIP, SAM, LLaVA, DINOv2, AM-RADIO, Pytorch), but it does not specify explicit version numbers for these software libraries or frameworks, which are necessary for reproducible dependency listing. |
| Experiment Setup | Yes | For VLMs with advanced language abilities, we embed the class name in the sentence Segment all the instances of class class_name in the image . As visual prompts, we sample one single image of the target class from the dataset itself, together with its ground truth segmentation mask. The minimal setup with a single prompt image, respectively an elementary text prompt, is a challenging and particularly user-friendly scenario. We also consciously refrain from any fine-tuning. For open-vocabulary segmentation models, we consider the fully-supervised approach CAT-Seg (Cho et al., 2024), the state of the art on the MESS dataset and the training-free approach NACLIP (Hajimiri et al., 2024). In particular, we use CAT-Seg with the CLIP Vi T-L/14 backbone, and NACLIP with the standard CLIP Vi T-B configuration. We also include SEEM (Zou et al., 2023), specifically the SEEM Davit-Large implementation. This is the only available model to accept TPs and VPs simultaneously, although in this section, we only use them separately. Combined prompting with SEEM is discussed in Section 5. As VLM baselines, we include the decoder-free Florence-2 (Xiao et al., 2023), specifically the segmentation branch of the large, fine-tuned model, where we clip the generated sequence length to 1024 for computational reasons; and PALI-Gemma (Beyer et al., 2024), a small but effective architecture that uses a VQVAE decoder (van den Oord et al., 2018). We use the standard 224-mix implementation. We also evaluate the recent LISA (Lai et al., 2024), in particular, the LISA-13B-llama2-v1 version. LISA integrates a Multi-modal Large Language Model (LLa VA (Liu et al., 2023b)) with a CLIP vision backbone and SAM. The model introduces a special <SEG> token to the LLM s vocabulary, employing an embedding-as-mask paradigm, where the hidden state corresponding to the <SEG> token is used by a fine-tuned SAM mask-decoder to generate segmentation masks. To keep the evaluation focused, and taking into account computational resource limitations, we regard LISA as proxy for its descendants GLAMM (Rasheed et al., 2024) and SESAME (Wu et al., 2023), which might offer marginal improvements. Our choice of VLMs is primarily informed by their performance in terms of referring segmentation on the Ref COCO, Ref COCO+, and Ref COCOg datasets (Kazemzadeh et al., 2014; Mao et al., 2016), a task that is closely related to our FPSS task. In all cases, we opt for greedy LLM decoding. |