reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines

Authors: Xinwei Long, Zhiyuan Ma, Ermo Hua, Kaiyan Zhang, Biqing Qi, Bowen Zhou

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on two representative OKVQA and A-OKVQA datasets demonstrate significant improvements ranging from 2.9% to 9.6% across all evaluation metrics when compared to strong baselines.
Researcher Affiliation	Academia	1Department of Electronic Engineering, Tsinghua University 2Shanghai Artificial Intelligence Laboratory EMAIL
Pseudocode	No	The paper describes the methodology in narrative text and figures (e.g., Figure 2: The architecture of Re Au SE) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The code will be available at https://github.com/xinwei666/Re Au SE
Open Datasets	Yes	We focus on the knowledge-based VQA benchmarks, OKVQA (Marino et al. 2019) and A-OKVQA (Schwenk et al. 2022). Previous work provided two retrieval corpora, GS112K (Luo et al. 2021) and Wiki21M (Karpukhin et al. 2020), for the OKVQA dataset. Additionally, we introduce a new information-seeking dataset, Info Seek (Chen et al. 2023d), to evaluate the model s retrieval performance.
Dataset Splits	Yes	We strictly follow the settings of the original papers, using the corresponding metrics for each dataset. For the OKVQA dataset and the direct answer setting of the A-OKVQA dataset, we use the VQA score to evaluate the model s performance. For the multi-choice setting of the A-OKVQA dataset, we use accuracy for evaluation.
Hardware Specification	Yes	Each training stage is performed on four NVIDIA A6000 48G GPUs and completed within three hours.
Software Dependencies	Yes	Our model is implemented in Py Torch, utilizing version 0.3.0 of the PEFT library, which supports efficient switching between two Lo RA adapters during inference.
Experiment Setup	Yes	In our main experiments, we utilize Mini GPT4v2-7B as the base model, which employ Vi T-L/14 from pretrained CLIP as the image encoder and LLa Ma-v2-7B (Touvron et al. 2023) as the text encoder. We freeze all parameters of the MLLM, allowing updates only to the Lo RA parameters. We use the same MLLM in the three stages but apply two sets of Lo RA parameters to optimize the model respectively: one for retrieval and alignment, and the other for answer generation.