Test-time Contrastive Concepts for Open-world Semantic Segmentation with Vision-Language Models
Authors: Monika Wysoczańska, Antonin Vobecky, Amaia Cardiel, Tomasz Trzcinski, Renaud Marlet, Andrei Bursuc, Oriane Siméoni
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct our experiments on six datasets widely used for the task of zero-shot semantic segmentation Cha et al. (2023), fully-annotated COCO-Stuff Caesar et al. (2018), Cityscapes Cordts et al. (2016) and ADE20K Zhou et al. (2019) and object-centric VOC Everingham et al. (2012), COCO-Object Caesar et al. (2018) and Context Mottaghi et al. (2014)... We compare results when using different CC s proposed in this work. We also include results when having access to privileged information (CCP I)... Table 1: Benefits of CC measured in Io U-single. Table 2: m Io U results. Table 3: Ablation studies. |
| Researcher Affiliation | Collaboration | 1Warsaw University of Technology 2valeo.ai 3CIIRC CTU Prague 4FEE CTU Prague 5Tooploox 6LIGM, École des Ponts et Chaussées, IP Paris, CNRS, France 7Université Grenoble Alpes |
| Pseudocode | Yes | We present a pseudo-code of our metric in Algorithm 1. |
| Open Source Code | No | The paper mentions using "MMSegmentation implementation Contributors (2020)", "Detectron Wu et al. (2019)", and "Mixtral-8x7B-Instruct model Jiang et al. (2024)" via the "Hugging Face transformers library". These are third-party tools. The paper does not contain an explicit statement or link to the authors' own source code for the methodology described. |
| Open Datasets | Yes | We conduct our experiments on six datasets widely used for the task of zero-shot semantic segmentation Cha et al. (2023), fully-annotated COCO-Stuff Caesar et al. (2018), Cityscapes Cordts et al. (2016) and ADE20K Zhou et al. (2019) and object-centric VOC Everingham et al. (2012), COCO-Object Caesar et al. (2018) and Context Mottaghi et al. (2014)... For CCD generation, we use the statistics gathered by Udandarao et al. (2024) for four thousand common concepts in the LAION-400M dataset, which is a subset of LAION-2B Schuhmann et al. (2022) and which is used to train CLIP Radford et al. (2021). |
| Dataset Splits | Yes | We treat the input images following the protocol of Cha et al. (2023), which we detail in Appendix A... We conduct our experiments on six datasets widely used for the task of zero-shot semantic segmentation. |
| Hardware Specification | Yes | This work was supported by the National Centre of Science (Poland) Grant No. 2022/45/B/ST6/02817 and by the grant from NVIDIA providing one RTX A5000 24GB used for this project... computed on a machine equipped with Intel(R) i7 CPU and a Nvidia RTX A5000 GPU |
| Software Dependencies | Yes | We use the recent Mixtral-8x7B-Instruct model Jiang et al. (2024)... More precisely, we rely on the v0.1 version of its open weights available via the Hugging Face transformers library. We run the LLM in 4-bit precision with flash attention to speedup inference. |
| Experiment Setup | Yes | We use MMSegmentation implementation Contributors (2020) with a sliding window strategy and resize input images to have a shorter side of 448. In the case of CAT-Seg, we retain the original model framework and integrate Io U-single into Detectron Wu et al. (2019)... For CCD generation... We filter contrastive concepts using a low co-occurrence threshold γ = 0.01 and a high CLIP similarity threshold δ = 0.8. In the classic m Io U scenario, we use a threshold β = 0.9... We discuss the selection of these values in Appendix C.1. |