Diffusion Feedback Helps CLIP See Better
Authors: Wenxuan Wang, Quan Sun, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that DIVA improves CLIP s performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7% ), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP s strong zero-shot capabilities. To evaluate the effectiveness of our DIVA and demonstrate its potential to enhance CLIP representations, comprehensive experiments are conducted on multimodal understanding and visual perception tasks, which will be elaborated in the followings. |
| Researcher Affiliation | Academia | 1 Institute of Automation, Chinese Academy of Sciences 2 School of Artificial Intelligence, University of Chinese Academy of Sciences 3 Beijing Academy of Artificial Intelligence 4 Institute of Information Science, Beijing Jiaotong University Correspondence to EMAIL. |
| Pseudocode | Yes | The pseudo code of the specific enhancement process can be found at Algorithm 1 in Appendix. B PSEUDO CODE FOR DIVA PIPELINE Algorithm 1 DIVA |
| Open Source Code | Yes | The code is publicly available at https://github.com/baaivision/DIVA. |
| Open Datasets | Yes | We only optimize the CLIP models with relatively high-quality Conceptual-3M dataset (Sharma et al., 2018) for 4600 steps (i.e., nearly 1 epoch) during training... We demonstrate that DIVA improves CLIP s performance on the challenging MMVP-VLM benchmark... Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP s strong zero-shot capabilities. The details about all benchmarks can be found at Table 9 in Appendix. |
| Dataset Splits | No | We only optimize the CLIP models with relatively high-quality Conceptual-3M dataset (Sharma et al., 2018) for 4600 steps (i.e., nearly 1 epoch) during training, which can already boost CLIP s performance in a training-efficient manner. |
| Hardware Specification | Yes | DIVA is trained on 8 NVIDIA-A100 80GB GPUs with a batch size of 640. |
| Software Dependencies | No | The paper discusses the use of specific diffusion models (e.g., SD-2-1-base) but does not provide version numbers for any software libraries or frameworks used (e.g., PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | DIVA is trained on 8 NVIDIA-A100 80GB GPUs with a batch size of 640. We adopt Stochastic Gradient Descent (SGD) optimizer with a learning rate of 1e-4 and momentum of 0.9 to refine CLIPs representations via generative feedback. We only optimize the CLIP models with relatively high-quality Conceptual-3M dataset (Sharma et al., 2018) for 4600 steps (i.e., nearly 1 epoch) during training... When increasing N from 1 to 2 (meaning that each image undergo diffusion sampling twice to provide two rounds of generative feedback for CLIP model optimization), performance gains are observed. Therefore, N=2 is selected as the optimal sampling step to consistently improve performance across various baselines. |