reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval

Authors: Guofeng Ding, Yiding Lu, Peng Hu, Mouxing Yang, Yijie Lin, Xi Peng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that VISA brings substantial improvements in text-to-image and text-to-video retrieval for both short- and long-context queries, offering a plug-and-play enhancement to existing retrieval systems.
Researcher Affiliation	Academia	1School of Computer Science, Sichuan University, Chengdu, China 2National Key Laboratory of Fundamental Algorithms and Models for Engineering Numerical Simulation, Sichuan University, China. Correspondence to: Yijie Lin <EMAIL>, Xi Peng <EMAIL>.
Pseudocode	No	The paper describes the method using natural language and diagrams (e.g., Fig. 2), but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/XLearning-SCU/2025-ICML-VISA
Open Datasets	Yes	We evaluate our approach on the MS-COCO (Chen et al., 2015) and Flickr30K (Plummer et al., 2015) datasets. For the video retrieval task, we evaluate our approach on four widely-used datasets: MSR-VTT (Jun et al., 2016), Di De Mo (Hendricks et al., 2017), LSMDC (Rohrbach et al., 2015), and MSVD (Chen & Dolan, 2011). To evaluate our approach, we utilize four datasets: DCI (Urbanek et al., 2024), IIW (Garg et al., 2024), Urban-1k (Zhang et al., 2024a), and Share GPT4v (Chen et al., 2023a).
Dataset Splits	Yes	MSCOCO includes 5,000 test images, each annotated with five manual annotated captions describing a diverse range of objects. Flickr30K emphasizes real-world scenarios with complex interactions, providing five detailed annotations for each of its 1,000 test images, capturing object relationships and actions. The test set of MSR-VTT contains 1000 videos. The test set of Di De Mo contains 1,003 videos... The test set of LSMDC contains 1,000 clips... The test set of MSVD contains 670 videos...
Hardware Specification	Yes	All experiments are conducted on Ubuntu 20.04 with NVIDIA 4090 GPUs.
Software Dependencies	Yes	Unless stated otherwise, we utilize LMMs LLa VA-v1.634B (Liu et al., 2024a) and LLa VA-Video-32B (Zhang et al., 2024b) to obtain the general descriptions of images and videos, respectively. For the question-answering process, we employ LLM Qwen2.5-32B (Team, 2024) as the question generator, LMM Qwen2-VL-7B (Wang et al., 2024a) as the answer generator, and gemma2 (Chen et al., 2024a) as the text retriever.
Experiment Setup	Yes	For video data, the frames per second (FPS) input to LMMs is set to 3. The size of the reranking gallery (k) is set to 20 and the number of questions is 3 for all datasets. For all evaluated models listed in Appendix E, we use the default hyper-parameters provided on Hugging Face.