Unveiling the Knowledge of CLIP for Training-Free Open-Vocabulary Semantic Segmentation
Authors: Yajie Liu, Guodong Wang, Jinjin Zhang, Qingjie Liu, Di Huang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments conducted on 9 segmentation benchmarks with various CLIP models demonstrate that CLIPSeg consistently outperforms all training-free methods by substantial margins, e.g., a 7.8% improvement in average m Io U for CLIP with a Vi T-L backbone, and competes with learning-based counterparts in generalizing to novel concepts in an efficient way. Ablation studies in this section, we conduct ablation studies to evaluate the effects of core components of the proposed method. |
| Researcher Affiliation | Academia | 1 State Key Laboratory of Complex and Critical Software Environment, Beihang University, Beijing 100191, China 2 School of Computer Science and Engineering, Beihang University, Beijing 100191, China EMAIL |
| Pseudocode | No | The paper describes the Coherence enhanced Residual Attention (CRA) and Deep Semantic Integration (DSI) modules using mathematical formulas and descriptive text (e.g., Eq. 3, Eq. 4) but does not include any formal pseudocode blocks or algorithms. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing its own source code, nor does it provide a direct link to a code repository. It mentions generating results for other methods using 'their official released code', but not for its own proposed method. |
| Open Datasets | Yes | We employ the COCO-Stuff dataset to evaluate the intra-image feature coherence. We follow the widely-used evaluation protocol, as introduced in TCL (Cha, Mun, and Roh 2023) to evaluate our method across 9 segmentation benchmarks in a zero-shot manner. These benchmarks are categorized into two groups: (i) without background class including Pascal VOC20 (Everingham et al. 2010) with 20 classes (denoted as V20), Pascal Context (Mottaghi et al. 2014) with 459 classes in the full version (C459) and the most frequent 59 classes in the C59 version, COCO-Stuff (Caesar, Uijlings, and Ferrari 2018) with 171 classes (STUFF), ADE20k (Zhou et al. 2019) with 847 classes in the full version (A847) and A150 version with the most frequent 150 classes and Cityscapes (CITY) (Cordts et al. 2016) with 19 classes. (ii) with a background class including Pascal Context 60 (C60) and COCO object with 80 classes (COCO). |
| Dataset Splits | Yes | We follow the widely-used evaluation protocol, as introduced in TCL (Cha, Mun, and Roh 2023) to evaluate our method across 9 segmentation benchmarks in a zero-shot manner. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions using various CLIP models with ViT-B and ViT-L backbones, which are model architectures, not hardware. |
| Software Dependencies | No | The paper mentions applying the framework to '8 widely-used CLIP models, namely CLIP, Open CLIP, Meta CLIP, with both Vi TB (denoted as -B) and Vi T-L (-L) backbones' but does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers. |
| Experiment Setup | Yes | The cosine distance threshold ϵ is empirically set to 0.75. The temperature in Eq. 3 is set to 6 across all models and datasets. If not specified, we ensemble the dense output of last three layers to construct V. |