reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

Authors: Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhaochong An, Karsten Roth, Ameya Prabhu, Philip Torr

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of k NN-CLIP on state-of-the-art open-vocabulary semantic and panoptic segmentation benchmarks with the FC-CLIP (Yu et al., 2023) model. k NN-CLIP achieves notable performance increases (m Io U) across various challenging datasets: A-847, PC-459, and A-150. Specifically, we see improvements of +2.6, +1.7 and +7.2 points respectively, demonstrating effective segmentation across a continually-growing vocabulary space.
Researcher Affiliation	Academia	Zhongrui Gui1 Shuyang Sun1 Runjia Li1 Jianhao Yuan1 Zhaochong An2 Karsten Roth3 Ameya Prabhu1,3 Philip Torr1 1University of Oxford 2University of Copenhagen 3University of Tübingen
Pseudocode	No	The paper describes the method using prose and mathematical equations but does not include a structured pseudocode or algorithm block.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described, nor does it explicitly state that the code will be released.
Open Datasets	Yes	Our study extends studying the impact of training-free continual vocabulary expansion of k NN-CLIP to semantic segmentation, testing its efficacy across dense prediction tasks. We analyze the performance of k NN-CLIP over five diverse datasets: ADE20K and A847 (Zhou et al., 2019), containing 27K images, Pascal Context(PC)-59/459 (Mottaghi et al., 2014), and Pascal VOC-21 (Everingham et al., 2010) containing 10K images.
Dataset Splits	Yes	Our database is created by extracting features from each dataset s training sets using DINOv2 (Oquab et al., 2023)...
Hardware Specification	Yes	All results were obtained using a single A40 GPU with CUDA 11.8 and Py Torch 2.0.0.
Software Dependencies	Yes	All results were obtained using a single A40 GPU with CUDA 11.8 and Py Torch 2.0.0.
Experiment Setup	Yes	Our database is created by extracting features from each dataset s training sets using DINOv2 (Oquab et al., 2023) with a Vi T-Giant architecture, featuring 4 register tokens (Darcet et al., 2023). We select "keys" as our feature representation. We resize the images to 518 518 and acquire an image feature of dimensions 1536 37 37, as the patch size of chosen Vi T is 14. We then conduct mask average pooling to shrink the feature dimension to 1536. Then, we store the shrunken feature with a dimension of 1536 and its corresponding label c into the database using FAISS. On average, a stored feature occupies 6k B of space. Note that we do not store past images alongside.