reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

Authors: Congpei Qiu, Yanhao Wu, Wei Ke, Xiuxiu Bai, Tong Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The effectiveness of our method is experimentally validated on the open-vocabulary dense prediction tasks, including object detection and image segmentation. With only a few epochs of finetuning on small datasets like COCO (Lin et al., 2014), our method achieves non-trivial performance improvements when integrated with the recent RLA methods like CLIPSelf (Wu et al., 2023b) and Region CLIP (Zhong et al., 2022) for object detection tasks. For the segmentation benchmarks, our method also improves the performance of the recent state-of-the-art model Cat-Seg (Cho et al., 2023). 4 EXPERIMENTAL RESULTS
Researcher Affiliation	Academia	1School of Software Engineering, Xi an Jiaotong University, China 2School of Computer and Communication Sciences, EPFL, Switzerland 3University of Chinese Academy of Sciences, Beijing, China
Pseudocode	No	The paper describes methods through textual descriptions and mathematical equations (e.g., equations 1-15) and visual diagrams (e.g., Figure 1, 2, 3, 4, 11), but does not contain any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code will be available at https://congpeiqiu.github.io/Refining
Open Datasets	Yes	With only a few epochs of finetuning on small datasets like COCO (Lin et al., 2014), our method achieves non-trivial performance improvements... Both stage are trained on COCO train2017 dataset (Lin et al., 2014)... using COCO Panoptic dataset masks (Kirillov et al., 2019)... for unsupervised segmentation on Cityscapes (Cordts et al., 2016)... Trained on the ADE20K dataset (Zhou et al., 2017) and evaluated on ADE-847, ADE-150, and Pascal Context (Mottaghi et al., 2014)... COCO-Stuff (Caesar et al., 2018)... on CC3M (Sharma et al., 2018)
Dataset Splits	Yes	Both stage are trained on COCO train2017 dataset (Lin et al., 2014)... We adopt F-Vi T (Wu et al., 2023b) as the open-vocabulary object detector... The F-Vi T model is trained for 3 epochs for the OV-COCO benchmark and 48 epochs for the OV-LVIS benchmark... By calculating the average CR value across COCO val2017
Hardware Specification	Yes	Concretely, we use 8 RTX 3090 GPUs for both stages with Adam W (Loshchilov & Hutter, 2017) optimizer.
Software Dependencies	No	The paper mentions using Adam W optimizer and Mask CLIP protocol, but does not provide specific version numbers for any software libraries or packages.
Experiment Setup	Yes	For the first stage, we set the learning rate to 1e 4 and train Refiner for 4 epochs with the batch size as 16. For the second stage, we set the learning rate to 2e 5 and perform CLIP fine-tuning for 6 epochs with the batch size as 4... To optimize Refiner, we generate C = 4 crops per image at scale ratios between [0.3, 0.7]. During the stage of spatial correlation distillation, we set the temperature τT = τS = 0.2, with λ = 0.2 for Vi T-B and λ = 0.4 for Vi T-L.