Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

Authors: Congpei Qiu, Yanhao Wu, Wei Ke, Xiuxiu Bai, Tong Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of our method is experimentally validated on the open-vocabulary dense prediction tasks, including object detection and image segmentation. With only a few epochs of finetuning on small datasets like COCO (Lin et al., 2014), our method achieves non-trivial performance improvements when integrated with the recent RLA methods like CLIPSelf (Wu et al., 2023b) and Region CLIP (Zhong et al., 2022) for object detection tasks. For the segmentation benchmarks, our method also improves the performance of the recent state-of-the-art model Cat-Seg (Cho et al., 2023). 4 EXPERIMENTAL RESULTS
Researcher Affiliation Academia 1School of Software Engineering, Xi an Jiaotong University, China 2School of Computer and Communication Sciences, EPFL, Switzerland 3University of Chinese Academy of Sciences, Beijing, China
Pseudocode No The paper describes methods through textual descriptions and mathematical equations (e.g., equations 1-15) and visual diagrams (e.g., Figure 1, 2, 3, 4, 11), but does not contain any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Code will be available at https://congpeiqiu.github.io/Refining
Open Datasets Yes With only a few epochs of finetuning on small datasets like COCO (Lin et al., 2014), our method achieves non-trivial performance improvements... Both stage are trained on COCO train2017 dataset (Lin et al., 2014)... using COCO Panoptic dataset masks (Kirillov et al., 2019)... for unsupervised segmentation on Cityscapes (Cordts et al., 2016)... Trained on the ADE20K dataset (Zhou et al., 2017) and evaluated on ADE-847, ADE-150, and Pascal Context (Mottaghi et al., 2014)... COCO-Stuff (Caesar et al., 2018)... on CC3M (Sharma et al., 2018)
Dataset Splits Yes Both stage are trained on COCO train2017 dataset (Lin et al., 2014)... We adopt F-Vi T (Wu et al., 2023b) as the open-vocabulary object detector... The F-Vi T model is trained for 3 epochs for the OV-COCO benchmark and 48 epochs for the OV-LVIS benchmark... By calculating the average CR value across COCO val2017
Hardware Specification Yes Concretely, we use 8 RTX 3090 GPUs for both stages with Adam W (Loshchilov & Hutter, 2017) optimizer.
Software Dependencies No The paper mentions using Adam W optimizer and Mask CLIP protocol, but does not provide specific version numbers for any software libraries or packages.
Experiment Setup Yes For the first stage, we set the learning rate to 1e 4 and train Refiner for 4 epochs with the batch size as 16. For the second stage, we set the learning rate to 2e 5 and perform CLIP fine-tuning for 6 epochs with the batch size as 4... To optimize Refiner, we generate C = 4 crops per image at scale ratios between [0.3, 0.7]. During the stage of spatial correlation distillation, we set the temperature τT = τS = 0.2, with λ = 0.2 for Vi T-B and λ = 0.4 for Vi T-L.