Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval
Authors: Guofeng Ding, Yiding Lu, Peng Hu, Mouxing Yang, Yijie Lin, Xi Peng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that VISA brings substantial improvements in text-to-image and text-to-video retrieval for both short- and long-context queries, offering a plug-and-play enhancement to existing retrieval systems. |
| Researcher Affiliation | Academia | 1School of Computer Science, Sichuan University, Chengdu, China 2National Key Laboratory of Fundamental Algorithms and Models for Engineering Numerical Simulation, Sichuan University, China. Correspondence to: Yijie Lin <EMAIL>, Xi Peng <EMAIL>. |
| Pseudocode | No | The paper describes the method using natural language and diagrams (e.g., Fig. 2), but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/XLearning-SCU/2025-ICML-VISA |
| Open Datasets | Yes | We evaluate our approach on the MS-COCO (Chen et al., 2015) and Flickr30K (Plummer et al., 2015) datasets. For the video retrieval task, we evaluate our approach on four widely-used datasets: MSR-VTT (Jun et al., 2016), Di De Mo (Hendricks et al., 2017), LSMDC (Rohrbach et al., 2015), and MSVD (Chen & Dolan, 2011). To evaluate our approach, we utilize four datasets: DCI (Urbanek et al., 2024), IIW (Garg et al., 2024), Urban-1k (Zhang et al., 2024a), and Share GPT4v (Chen et al., 2023a). |
| Dataset Splits | Yes | MSCOCO includes 5,000 test images, each annotated with five manual annotated captions describing a diverse range of objects. Flickr30K emphasizes real-world scenarios with complex interactions, providing five detailed annotations for each of its 1,000 test images, capturing object relationships and actions. The test set of MSR-VTT contains 1000 videos. The test set of Di De Mo contains 1,003 videos... The test set of LSMDC contains 1,000 clips... The test set of MSVD contains 670 videos... |
| Hardware Specification | Yes | All experiments are conducted on Ubuntu 20.04 with NVIDIA 4090 GPUs. |
| Software Dependencies | Yes | Unless stated otherwise, we utilize LMMs LLa VA-v1.634B (Liu et al., 2024a) and LLa VA-Video-32B (Zhang et al., 2024b) to obtain the general descriptions of images and videos, respectively. For the question-answering process, we employ LLM Qwen2.5-32B (Team, 2024) as the question generator, LMM Qwen2-VL-7B (Wang et al., 2024a) as the answer generator, and gemma2 (Chen et al., 2024a) as the text retriever. |
| Experiment Setup | Yes | For video data, the frames per second (FPS) input to LMMs is set to 3. The size of the reranking gallery (k) is set to 20 and the number of questions is 3 for all datasets. For all evaluated models listed in Appendix E, we use the default hyper-parameters provided on Hugging Face. |