reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Diffusion Feedback Helps CLIP See Better

Authors: Wenxuan Wang, Quan Sun, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that DIVA improves CLIP s performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7% ), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP s strong zero-shot capabilities. To evaluate the effectiveness of our DIVA and demonstrate its potential to enhance CLIP representations, comprehensive experiments are conducted on multimodal understanding and visual perception tasks, which will be elaborated in the followings.
Researcher Affiliation	Academia	1 Institute of Automation, Chinese Academy of Sciences 2 School of Artificial Intelligence, University of Chinese Academy of Sciences 3 Beijing Academy of Artificial Intelligence 4 Institute of Information Science, Beijing Jiaotong University Correspondence to EMAIL.
Pseudocode	Yes	The pseudo code of the specific enhancement process can be found at Algorithm 1 in Appendix. B PSEUDO CODE FOR DIVA PIPELINE Algorithm 1 DIVA
Open Source Code	Yes	The code is publicly available at https://github.com/baaivision/DIVA.
Open Datasets	Yes	We only optimize the CLIP models with relatively high-quality Conceptual-3M dataset (Sharma et al., 2018) for 4600 steps (i.e., nearly 1 epoch) during training... We demonstrate that DIVA improves CLIP s performance on the challenging MMVP-VLM benchmark... Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP s strong zero-shot capabilities. The details about all benchmarks can be found at Table 9 in Appendix.
Dataset Splits	No	We only optimize the CLIP models with relatively high-quality Conceptual-3M dataset (Sharma et al., 2018) for 4600 steps (i.e., nearly 1 epoch) during training, which can already boost CLIP s performance in a training-efficient manner.
Hardware Specification	Yes	DIVA is trained on 8 NVIDIA-A100 80GB GPUs with a batch size of 640.
Software Dependencies	No	The paper discusses the use of specific diffusion models (e.g., SD-2-1-base) but does not provide version numbers for any software libraries or frameworks used (e.g., PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	DIVA is trained on 8 NVIDIA-A100 80GB GPUs with a batch size of 640. We adopt Stochastic Gradient Descent (SGD) optimizer with a learning rate of 1e-4 and momentum of 0.9 to refine CLIPs representations via generative feedback. We only optimize the CLIP models with relatively high-quality Conceptual-3M dataset (Sharma et al., 2018) for 4600 steps (i.e., nearly 1 epoch) during training... When increasing N from 1 to 2 (meaning that each image undergo diffusion sampling twice to provide two rounds of generative feedback for CLIP model optimization), performance gains are observed. Therefore, N=2 is selected as the optimal sampling step to consistently improve performance across various baselines.