reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering

Authors: Chun-Mei Feng, Yang Bai, Tao Luo, Zhen Li, Salman Khan, Wangmeng Zuo, Rick Siow Mong Goh, Yong Liu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets. ... Extensive experiments are conducted on the CIRR and Fashion-IQ datasets. The results show that our VQA4CIR can be incorporated with different CIR methods and outperforms the state-of-the-art CIR methods. To sum up, the contributions of this work are three-fold: ... Experimental results show that our VQA4CIR outperforms the state-of-the-art CIR methods and can be directly plugged into existing CIR methods.
Researcher Affiliation	Academia	1Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore 2SSE, The Chinese University of Hong Kong, Shenzhen (CUHK), China 3Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), UAE 4Australian National University, Canberra ACT, Australia 5Harbin Institute of Technology, Harbin, China
Pseudocode	No	The paper describes the methodology using textual explanations and illustrative figures (Figure 2, 3, 4) but does not include any explicit pseudocode blocks or algorithms labeled as such.
Open Source Code	Yes	Code https://github.com/chunmeifeng/VQA4CIR
Open Datasets	Yes	Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets. ... Extensive experiments are conducted on the CIRR and Fashion-IQ datasets. The detailed setups follow previous works (Suhr et al. 2018; Wu et al. 2021).
Dataset Splits	Yes	During training, we randomly adopt 5, 000 and 3, 000 samples from the CIRR dataset and Fashion IQ training data, respectively, to fine-tune LLa MA (Touvron et al. 2023) and LLa VA (Liu et al. 2023a). ... We evaluate our method on two CIR benchmarks, i.e., CIRR (Suhr et al. 2018) and Fashion-IQ (Wu et al. 2021). The detailed setups follow previous works (Suhr et al. 2018; Wu et al. 2021).
Hardware Specification	Yes	Our VQA4CIR is implemented with Pytorch on NVIDIA RTX A100 GPUs with 40GB of memory per card.
Software Dependencies	Yes	Our VQA4CIR is implemented with Pytorch on NVIDIA RTX A100 GPUs with 40GB of memory per card. To preserve the generalization ability of the pre-trained models, i.e., LLa MA (Touvron et al. 2023) and LLa VA (Liu et al. 2023a), we leverage Lo RA (Hu et al. 2021) to fine-tune them while keeping the backbones frozen, i.e., LLa VA-v1.5-13B and Vicuna-13B-v1.5.
Experiment Setup	Yes	The Adam W (Loshchilov and Hutter 2017) is adopted as the optimizer with a weight decay of 0.05 across all the experiments. We adopt Warmup Decay LR as the learning rate scheduler with warmup iterations of 1, 000. For LLa VA (Liu et al. 2023a), the learning rate is initialized at 2e-5, while for LLa MA (Touvron et al. 2023), it is initialized at 3e-4. The hyperparameter of α is respectively set to 20 and 30 on the CIRR and Fashion-IQ datasets, while β is empirically set to 10 and 12.