EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models

Authors: GuangHao Meng, Sunan He, Jinpeng Wang, Tao Dai, Letian Zhang, Jieming Zhu, Qing Li, Gang Wang, Rui Zhang, Yong Jiang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive quantitative and qualitative experiments on image-text retrieval benchmarks validate the superiority of Evd CLIP on vision-language retrieval tasks.
Researcher Affiliation Collaboration 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Peng Cheng Laboratory 3School of Computer Science & Tech, Huazhong University of Science and Technology 4College of Computer Science and Software Engineering, Shenzhen University 5Huawei Noah s Ark Lab 6Hong Kong University of Science and Technology
Pseudocode No The paper describes methods textually and includes an overall architecture diagram (Figure 3), but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement or a link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes Benchmark Datasets: (1) Flickr30K (Plummer et al. 2015) contains 31,000 images, each annotated with 5 captions. ... (2) MSCOCO (Lin et al. 2014) comprises 123,287 images, each annotated with 5 captions. ... (3) MSR-VTT (Xu et al. 2016) includes 10K videos, each with 200K text. ... (4) SBU30k (Ordonez, Kulkarni, and Berg 2011) consists of 36k image-text pairs, randomly sampled from SBU Captions and split into 30K/3K/3K for training, validation, and testing. Similarly, we obtain (6) CC30K and (7) YFCC30K by randomly sampling from CC12M and YFCC15M.
Dataset Splits Yes Benchmark Datasets: (1) Flickr30K ... split into 29K/1k/1k images for training, validation and testing. (2) MSCOCO ... split it into 114K/5K/5K for training, validation, and testing. (3) MSR-VTT ... We employ 9K videos for training and evaluation on the 1K test set. (4) SBU30k ... split into 30K/3K/3K for training, validation, and testing.
Hardware Specification No The paper mentions vision encoder models like "Vi T-B/32" but does not specify any particular hardware (e.g., GPU models, CPU models, or memory) used for running the experiments.
Software Dependencies No The paper mentions using a "pre-trained T5-large model" and the "Adam optimizer (Kingma and Ba 2014)", but it does not provide specific version numbers for software libraries, programming languages, or other dependencies needed to reproduce the experiments.
Experiment Setup Yes We conduct the warm-up phase of Ea RW with a learning rate of 3e-5, a batch size of 8, and over 20 epochs. For the Rank Preference Optimisation (RPO) model, we set the learning rate to 5e-7, with a batch size of 16, across 5 epochs, and used a rank length of 5. The weight of the SFT loss β is set to 0.2, and the probability of random rewriting during CLIP fine-tuning p is set to 0.6. For the hyper-parameters used for fine-tuning CLIP, we employ the Adam optimizer (Kingma and Ba 2014) with weight decay of 1e-3 and batch size is set to 256. The total number of fine-tuning epochs is set to 20. The initial learning rate is set to 1e-6 and a cosine learning rate decay scheduler is applied. We apply a warm-up strategy for the initial 2k steps.