Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines

Authors: Xinwei Long, Zhiyuan Ma, Ermo Hua, Kaiyan Zhang, Biqing Qi, Bowen Zhou

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on two representative OKVQA and A-OKVQA datasets demonstrate significant improvements ranging from 2.9% to 9.6% across all evaluation metrics when compared to strong baselines.
Researcher Affiliation Academia 1Department of Electronic Engineering, Tsinghua University 2Shanghai Artificial Intelligence Laboratory EMAIL
Pseudocode No The paper describes the methodology in narrative text and figures (e.g., Figure 2: The architecture of Re Au SE) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The code will be available at https://github.com/xinwei666/Re Au SE
Open Datasets Yes We focus on the knowledge-based VQA benchmarks, OKVQA (Marino et al. 2019) and A-OKVQA (Schwenk et al. 2022). Previous work provided two retrieval corpora, GS112K (Luo et al. 2021) and Wiki21M (Karpukhin et al. 2020), for the OKVQA dataset. Additionally, we introduce a new information-seeking dataset, Info Seek (Chen et al. 2023d), to evaluate the model s retrieval performance.
Dataset Splits Yes We strictly follow the settings of the original papers, using the corresponding metrics for each dataset. For the OKVQA dataset and the direct answer setting of the A-OKVQA dataset, we use the VQA score to evaluate the model s performance. For the multi-choice setting of the A-OKVQA dataset, we use accuracy for evaluation.
Hardware Specification Yes Each training stage is performed on four NVIDIA A6000 48G GPUs and completed within three hours.
Software Dependencies Yes Our model is implemented in Py Torch, utilizing version 0.3.0 of the PEFT library, which supports efficient switching between two Lo RA adapters during inference.
Experiment Setup Yes In our main experiments, we utilize Mini GPT4v2-7B as the base model, which employ Vi T-L/14 from pretrained CLIP as the image encoder and LLa Ma-v2-7B (Touvron et al. 2023) as the text encoder. We freeze all parameters of the MLLM, allowing updates only to the Lo RA parameters. We use the same MLLM in the three stages but apply two sets of Lo RA parameters to optimize the model respectively: one for retrieval and alignment, and the other for answer generation.