VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering
Authors: Chun-Mei Feng, Yang Bai, Tao Luo, Zhen Li, Salman Khan, Wangmeng Zuo, Rick Siow Mong Goh, Yong Liu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets. ... Extensive experiments are conducted on the CIRR and Fashion-IQ datasets. The results show that our VQA4CIR can be incorporated with different CIR methods and outperforms the state-of-the-art CIR methods. To sum up, the contributions of this work are three-fold: ... Experimental results show that our VQA4CIR outperforms the state-of-the-art CIR methods and can be directly plugged into existing CIR methods. |
| Researcher Affiliation | Academia | 1Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore 2SSE, The Chinese University of Hong Kong, Shenzhen (CUHK), China 3Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), UAE 4Australian National University, Canberra ACT, Australia 5Harbin Institute of Technology, Harbin, China |
| Pseudocode | No | The paper describes the methodology using textual explanations and illustrative figures (Figure 2, 3, 4) but does not include any explicit pseudocode blocks or algorithms labeled as such. |
| Open Source Code | Yes | Code https://github.com/chunmeifeng/VQA4CIR |
| Open Datasets | Yes | Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets. ... Extensive experiments are conducted on the CIRR and Fashion-IQ datasets. The detailed setups follow previous works (Suhr et al. 2018; Wu et al. 2021). |
| Dataset Splits | Yes | During training, we randomly adopt 5, 000 and 3, 000 samples from the CIRR dataset and Fashion IQ training data, respectively, to fine-tune LLa MA (Touvron et al. 2023) and LLa VA (Liu et al. 2023a). ... We evaluate our method on two CIR benchmarks, i.e., CIRR (Suhr et al. 2018) and Fashion-IQ (Wu et al. 2021). The detailed setups follow previous works (Suhr et al. 2018; Wu et al. 2021). |
| Hardware Specification | Yes | Our VQA4CIR is implemented with Pytorch on NVIDIA RTX A100 GPUs with 40GB of memory per card. |
| Software Dependencies | Yes | Our VQA4CIR is implemented with Pytorch on NVIDIA RTX A100 GPUs with 40GB of memory per card. To preserve the generalization ability of the pre-trained models, i.e., LLa MA (Touvron et al. 2023) and LLa VA (Liu et al. 2023a), we leverage Lo RA (Hu et al. 2021) to fine-tune them while keeping the backbones frozen, i.e., LLa VA-v1.5-13B and Vicuna-13B-v1.5. |
| Experiment Setup | Yes | The Adam W (Loshchilov and Hutter 2017) is adopted as the optimizer with a weight decay of 0.05 across all the experiments. We adopt Warmup Decay LR as the learning rate scheduler with warmup iterations of 1, 000. For LLa VA (Liu et al. 2023a), the learning rate is initialized at 2e-5, while for LLa MA (Touvron et al. 2023), it is initialized at 3e-4. The hyperparameter of α is respectively set to 20 and 30 on the CIRR and Fashion-IQ datasets, while β is empirically set to 10 and 12. |