kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

Authors: Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhaochong An, Karsten Roth, Ameya Prabhu, Philip Torr

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of k NN-CLIP on state-of-the-art open-vocabulary semantic and panoptic segmentation benchmarks with the FC-CLIP (Yu et al., 2023) model. k NN-CLIP achieves notable performance increases (m Io U) across various challenging datasets: A-847, PC-459, and A-150. Specifically, we see improvements of +2.6, +1.7 and +7.2 points respectively, demonstrating effective segmentation across a continually-growing vocabulary space.
Researcher Affiliation Academia Zhongrui Gui1 Shuyang Sun1 Runjia Li1 Jianhao Yuan1 Zhaochong An2 Karsten Roth3 Ameya Prabhu1,3 Philip Torr1 1University of Oxford 2University of Copenhagen 3University of Tübingen
Pseudocode No The paper describes the method using prose and mathematical equations but does not include a structured pseudocode or algorithm block.
Open Source Code No The paper does not provide concrete access to source code for the methodology described, nor does it explicitly state that the code will be released.
Open Datasets Yes Our study extends studying the impact of training-free continual vocabulary expansion of k NN-CLIP to semantic segmentation, testing its efficacy across dense prediction tasks. We analyze the performance of k NN-CLIP over five diverse datasets: ADE20K and A847 (Zhou et al., 2019), containing 27K images, Pascal Context(PC)-59/459 (Mottaghi et al., 2014), and Pascal VOC-21 (Everingham et al., 2010) containing 10K images.
Dataset Splits Yes Our database is created by extracting features from each dataset s training sets using DINOv2 (Oquab et al., 2023)...
Hardware Specification Yes All results were obtained using a single A40 GPU with CUDA 11.8 and Py Torch 2.0.0.
Software Dependencies Yes All results were obtained using a single A40 GPU with CUDA 11.8 and Py Torch 2.0.0.
Experiment Setup Yes Our database is created by extracting features from each dataset s training sets using DINOv2 (Oquab et al., 2023) with a Vi T-Giant architecture, featuring 4 register tokens (Darcet et al., 2023). We select "keys" as our feature representation. We resize the images to 518 518 and acquire an image feature of dimensions 1536 37 37, as the patch size of chosen Vi T is 14. We then conduct mask average pooling to shrink the feature dimension to 1536. Then, we store the shrunken feature with a dimension of 1536 and its corresponding label c into the database using FAISS. On average, a stored feature occupies 6k B of space. Note that we do not store past images alongside.