Refining CLIP's Spatial Awareness: A Visual-Centric Perspective
Authors: Congpei Qiu, Yanhao Wu, Wei Ke, Xiuxiu Bai, Tong Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of our method is experimentally validated on the open-vocabulary dense prediction tasks, including object detection and image segmentation. With only a few epochs of finetuning on small datasets like COCO (Lin et al., 2014), our method achieves non-trivial performance improvements when integrated with the recent RLA methods like CLIPSelf (Wu et al., 2023b) and Region CLIP (Zhong et al., 2022) for object detection tasks. For the segmentation benchmarks, our method also improves the performance of the recent state-of-the-art model Cat-Seg (Cho et al., 2023). 4 EXPERIMENTAL RESULTS |
| Researcher Affiliation | Academia | 1School of Software Engineering, Xi an Jiaotong University, China 2School of Computer and Communication Sciences, EPFL, Switzerland 3University of Chinese Academy of Sciences, Beijing, China |
| Pseudocode | No | The paper describes methods through textual descriptions and mathematical equations (e.g., equations 1-15) and visual diagrams (e.g., Figure 1, 2, 3, 4, 11), but does not contain any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | Code will be available at https://congpeiqiu.github.io/Refining |
| Open Datasets | Yes | With only a few epochs of finetuning on small datasets like COCO (Lin et al., 2014), our method achieves non-trivial performance improvements... Both stage are trained on COCO train2017 dataset (Lin et al., 2014)... using COCO Panoptic dataset masks (Kirillov et al., 2019)... for unsupervised segmentation on Cityscapes (Cordts et al., 2016)... Trained on the ADE20K dataset (Zhou et al., 2017) and evaluated on ADE-847, ADE-150, and Pascal Context (Mottaghi et al., 2014)... COCO-Stuff (Caesar et al., 2018)... on CC3M (Sharma et al., 2018) |
| Dataset Splits | Yes | Both stage are trained on COCO train2017 dataset (Lin et al., 2014)... We adopt F-Vi T (Wu et al., 2023b) as the open-vocabulary object detector... The F-Vi T model is trained for 3 epochs for the OV-COCO benchmark and 48 epochs for the OV-LVIS benchmark... By calculating the average CR value across COCO val2017 |
| Hardware Specification | Yes | Concretely, we use 8 RTX 3090 GPUs for both stages with Adam W (Loshchilov & Hutter, 2017) optimizer. |
| Software Dependencies | No | The paper mentions using Adam W optimizer and Mask CLIP protocol, but does not provide specific version numbers for any software libraries or packages. |
| Experiment Setup | Yes | For the first stage, we set the learning rate to 1e 4 and train Refiner for 4 epochs with the batch size as 16. For the second stage, we set the learning rate to 2e 5 and perform CLIP fine-tuning for 6 epochs with the batch size as 4... To optimize Refiner, we generate C = 4 crops per image at scale ratios between [0.3, 0.7]. During the stage of spatial correlation distillation, we set the temperature τT = τS = 0.2, with λ = 0.2 for Vi T-B and λ = 0.4 for Vi T-L. |