Diffusion Feedback Helps CLIP See Better

Authors: Wenxuan Wang, Quan Sun, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that DIVA improves CLIP s performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7% ), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP s strong zero-shot capabilities. To evaluate the effectiveness of our DIVA and demonstrate its potential to enhance CLIP representations, comprehensive experiments are conducted on multimodal understanding and visual perception tasks, which will be elaborated in the followings.
Researcher Affiliation Academia 1 Institute of Automation, Chinese Academy of Sciences 2 School of Artificial Intelligence, University of Chinese Academy of Sciences 3 Beijing Academy of Artificial Intelligence 4 Institute of Information Science, Beijing Jiaotong University Correspondence to EMAIL.
Pseudocode Yes The pseudo code of the specific enhancement process can be found at Algorithm 1 in Appendix. B PSEUDO CODE FOR DIVA PIPELINE Algorithm 1 DIVA
Open Source Code Yes The code is publicly available at https://github.com/baaivision/DIVA.
Open Datasets Yes We only optimize the CLIP models with relatively high-quality Conceptual-3M dataset (Sharma et al., 2018) for 4600 steps (i.e., nearly 1 epoch) during training... We demonstrate that DIVA improves CLIP s performance on the challenging MMVP-VLM benchmark... Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP s strong zero-shot capabilities. The details about all benchmarks can be found at Table 9 in Appendix.
Dataset Splits No We only optimize the CLIP models with relatively high-quality Conceptual-3M dataset (Sharma et al., 2018) for 4600 steps (i.e., nearly 1 epoch) during training, which can already boost CLIP s performance in a training-efficient manner.
Hardware Specification Yes DIVA is trained on 8 NVIDIA-A100 80GB GPUs with a batch size of 640.
Software Dependencies No The paper discusses the use of specific diffusion models (e.g., SD-2-1-base) but does not provide version numbers for any software libraries or frameworks used (e.g., PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes DIVA is trained on 8 NVIDIA-A100 80GB GPUs with a batch size of 640. We adopt Stochastic Gradient Descent (SGD) optimizer with a learning rate of 1e-4 and momentum of 0.9 to refine CLIPs representations via generative feedback. We only optimize the CLIP models with relatively high-quality Conceptual-3M dataset (Sharma et al., 2018) for 4600 steps (i.e., nearly 1 epoch) during training... When increasing N from 1 to 2 (meaning that each image undergo diffusion sampling twice to provide two rounds of generative feedback for CLIP model optimization), performance gains are observed. Therefore, N=2 is selected as the optimal sampling step to consistently improve performance across various baselines.