Unveiling the Knowledge of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

Authors: Yajie Liu, Guodong Wang, Jinjin Zhang, Qingjie Liu, Di Huang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments conducted on 9 segmentation benchmarks with various CLIP models demonstrate that CLIPSeg consistently outperforms all training-free methods by substantial margins, e.g., a 7.8% improvement in average m Io U for CLIP with a Vi T-L backbone, and competes with learning-based counterparts in generalizing to novel concepts in an efficient way. Ablation studies in this section, we conduct ablation studies to evaluate the effects of core components of the proposed method.
Researcher Affiliation Academia 1 State Key Laboratory of Complex and Critical Software Environment, Beihang University, Beijing 100191, China 2 School of Computer Science and Engineering, Beihang University, Beijing 100191, China EMAIL
Pseudocode No The paper describes the Coherence enhanced Residual Attention (CRA) and Deep Semantic Integration (DSI) modules using mathematical formulas and descriptive text (e.g., Eq. 3, Eq. 4) but does not include any formal pseudocode blocks or algorithms.
Open Source Code No The paper does not contain an explicit statement about releasing its own source code, nor does it provide a direct link to a code repository. It mentions generating results for other methods using 'their official released code', but not for its own proposed method.
Open Datasets Yes We employ the COCO-Stuff dataset to evaluate the intra-image feature coherence. We follow the widely-used evaluation protocol, as introduced in TCL (Cha, Mun, and Roh 2023) to evaluate our method across 9 segmentation benchmarks in a zero-shot manner. These benchmarks are categorized into two groups: (i) without background class including Pascal VOC20 (Everingham et al. 2010) with 20 classes (denoted as V20), Pascal Context (Mottaghi et al. 2014) with 459 classes in the full version (C459) and the most frequent 59 classes in the C59 version, COCO-Stuff (Caesar, Uijlings, and Ferrari 2018) with 171 classes (STUFF), ADE20k (Zhou et al. 2019) with 847 classes in the full version (A847) and A150 version with the most frequent 150 classes and Cityscapes (CITY) (Cordts et al. 2016) with 19 classes. (ii) with a background class including Pascal Context 60 (C60) and COCO object with 80 classes (COCO).
Dataset Splits Yes We follow the widely-used evaluation protocol, as introduced in TCL (Cha, Mun, and Roh 2023) to evaluate our method across 9 segmentation benchmarks in a zero-shot manner.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions using various CLIP models with ViT-B and ViT-L backbones, which are model architectures, not hardware.
Software Dependencies No The paper mentions applying the framework to '8 widely-used CLIP models, namely CLIP, Open CLIP, Meta CLIP, with both Vi TB (denoted as -B) and Vi T-L (-L) backbones' but does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers.
Experiment Setup Yes The cosine distance threshold ϵ is empirically set to 0.75. The temperature in Eq. 3 is set to 6 across all models and datasets. If not specified, we ensemble the dense output of last three layers to construct V.