reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation

Authors: Niccolò Avogaro, Thomas Frick, Mattia Rigotti, Andrea Bartezzaghi, Filip Janicki, A. Cristiano I. Malossi, Konrad Schindler, Roy Assaf

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We examine the seemingly obvious question: how to eﬀectively prompt VLMs for semantic segmentation. To that end, we systematically evaluate the segmentation performance of several recent models guided by either text or visual prompts on the out-of-distribution MESS dataset collection. We introduce a scalable prompting scheme, few-shot prompted semantic segmentation, inspired by open-vocabulary segmentation and few-shot learning. It turns out that VLMs lag far behind specialist models trained for a speciﬁc segmentation task by about 30% on average on the Intersection-over-Union metric. Moreover, we ﬁnd that text prompts and visual prompts are complementary: each one of the two modes fails on many examples that the other one can solve. Our analysis suggests that being able to anticipate the most eﬀective prompt modality can lead to an 11% improvement in performance. Motivated by our ﬁndings, we propose Prompt Matcher, a remarkably simple training-free baseline that combines both text and visual prompts, achieving state-of-the-art results outperforming the best text-prompted VLM by 2.5%, and the top visual-prompted VLM by 3.5% on few-shot prompted semantic segmentation.
Researcher Affiliation	Collaboration	Niccolo Avogaro EMAIL IBM Research ETH Zurich Thomas Frick EMAIL IBM Research Mattia Rigotti EMAIL IBM Research Andrea Bartezzaghi EMAIL IBM Research Filip M. Janicki EMAIL IBM Research Cristiano Malossi EMAIL IBM Research Konrad Schindler EMAIL ETH Zurich Roy Assaf EMAIL IBM Research
Pseudocode	Yes	Pseudocode detailing inner workings of the framework can be found in Appendix A.6
Open Source Code	No	The paper does not contain an explicit statement about releasing its own source code (e.g., for Prompt Matcher) or a link to a code repository. It mentions using various third-party models and frameworks but not providing its own implementation.
Open Datasets	Yes	As a testbed for our experiments, we use the MESS dataset collection (Blumenstiel et al., 2023). It consists of 22 diﬀerent segmentation datasets that span a wide variety of application domains and image characteristics. The datasets are grouped into ﬁve broad domains, General (6 datasets), Earth (5), Medical (4), Engineering (4) and Agriculture (3) as detailed in Table 6. The MESS dataset collection was deliberately designed as a challenging benchmark for open-vocabulary models and is an ideal choice for evaluating foundation models.
Dataset Splits	Yes	As visual prompts, we sample one single image of the target class from the dataset itself, together with its ground truth segmentation mask. The minimal setup with a single prompt image, respectively an elementary text prompt, is a challenging and particularly user-friendly scenario. Picking the prompt image from the same dataset corresponds to the realistic scenario where the user creates the prompt on images acquired in their application setting, with similar imaging conditions and class deﬁnitions as the test data. To minimise biases due to the choice of prompt image, we sample a diﬀerent prompt image for each prediction.
Hardware Specification	Yes	The evaluations were run on a single A100 with 40GB of memory, which takes 14 hours for one complete run with the largest model (LISA-13B).
Software Dependencies	No	The paper mentions various models and backbones used (e.g., CLIP, SAM, LLaVA, DINOv2, AM-RADIO, Pytorch), but it does not specify explicit version numbers for these software libraries or frameworks, which are necessary for reproducible dependency listing.
Experiment Setup	Yes	For VLMs with advanced language abilities, we embed the class name in the sentence Segment all the instances of class class_name in the image . As visual prompts, we sample one single image of the target class from the dataset itself, together with its ground truth segmentation mask. The minimal setup with a single prompt image, respectively an elementary text prompt, is a challenging and particularly user-friendly scenario. We also consciously refrain from any ﬁne-tuning. For open-vocabulary segmentation models, we consider the fully-supervised approach CAT-Seg (Cho et al., 2024), the state of the art on the MESS dataset and the training-free approach NACLIP (Hajimiri et al., 2024). In particular, we use CAT-Seg with the CLIP Vi T-L/14 backbone, and NACLIP with the standard CLIP Vi T-B conﬁguration. We also include SEEM (Zou et al., 2023), speciﬁcally the SEEM Davit-Large implementation. This is the only available model to accept TPs and VPs simultaneously, although in this section, we only use them separately. Combined prompting with SEEM is discussed in Section 5. As VLM baselines, we include the decoder-free Florence-2 (Xiao et al., 2023), speciﬁcally the segmentation branch of the large, ﬁne-tuned model, where we clip the generated sequence length to 1024 for computational reasons; and PALI-Gemma (Beyer et al., 2024), a small but eﬀective architecture that uses a VQVAE decoder (van den Oord et al., 2018). We use the standard 224-mix implementation. We also evaluate the recent LISA (Lai et al., 2024), in particular, the LISA-13B-llama2-v1 version. LISA integrates a Multi-modal Large Language Model (LLa VA (Liu et al., 2023b)) with a CLIP vision backbone and SAM. The model introduces a special <SEG> token to the LLM s vocabulary, employing an embedding-as-mask paradigm, where the hidden state corresponding to the <SEG> token is used by a ﬁne-tuned SAM mask-decoder to generate segmentation masks. To keep the evaluation focused, and taking into account computational resource limitations, we regard LISA as proxy for its descendants GLAMM (Rasheed et al., 2024) and SESAME (Wu et al., 2023), which might oﬀer marginal improvements. Our choice of VLMs is primarily informed by their performance in terms of referring segmentation on the Ref COCO, Ref COCO+, and Ref COCOg datasets (Kazemzadeh et al., 2014; Mao et al., 2016), a task that is closely related to our FPSS task. In all cases, we opt for greedy LLM decoding.