Towards Robust Visual Question Answering via Prompt-Driven Geometric Harmonization

Authors: Yishu Liu, Jiawei Zhu, Congcong Wen, Guangming Lu, Hui Lin, Bingzhi Chen

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on various general and medical VQA datasets demonstrate the consistent superiority of our PDGH approach over existing state-of-the-art baselines.
Researcher Affiliation Academia 1Harbin Institute of Technology Shenzhen, Shenzhen, China 2Beijing Institute of Technology, Zhuhai, China 3China Academic of Electronics and Information Technology, Beijing, China
Pseudocode No The paper describes the methodology using mathematical formulations and textual descriptions, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not include an unambiguous statement about releasing code or a direct link to a source-code repository for the methodology described.
Open Datasets Yes In our experiments, we select various out-of-distribution benchmarks to assess the robustness of models against real-world biases, such as VQA-CP v2, VQA-CP v1 (Agrawal et al. 2018), GQAOOD (Kervadec et al. 2021), and VQA-CE (Dancette et al. 2021). Following VQA-CP (Agrawal et al. 2018), we develop a Semantically-Labeled Knowledge-Enhanced under Language Bias (SLAKE-LB) benchmark based on SLAKE (Liu et al. 2021) to verify the performance of our method in the medical domain.
Dataset Splits Yes Following VQA-CP (Agrawal et al. 2018), we develop a Semantically-Labeled Knowledge-Enhanced under Language Bias (SLAKE-LB) benchmark based on SLAKE (Liu et al. 2021) to verify the performance of our method in the medical domain. All experiments utilize the standard VQA evaluation metric (Antol et al. 2015). ... We re-split this dataset using the same partitioning ratio as VQA-CP v2 to maintain a consistent sample structure.
Hardware Specification Yes In our experiments, we implement the PDGH model on a single RTX 3090 GPU with PyTorch.
Software Dependencies No In our experiments, we implement the PDGH model on a single RTX 3090 GPU with PyTorch. ... we use Key BERT (Grootendorst 2020) to extract keywords from the query, which can be formulated as follows, Ki = Wk i,1, Wk i,2, . . . , Wk i,Lk Nk i=1 = Keybert(Q), ... we use Key BERT (Grootendorst 2020) to extract keywords from the query, which can be formulated as follows, Ki = Wk i,1, Wk i,2, . . . , Wk i,Lk Nk i=1 = Keybert(Q), (4) where Q represents the query text, Ki is the i-th key phrase, Lk is the length of key phrases, and Nk is the number of extracted key phrases. Next, we use the extracted keywords to guide the generation of image captions, which can be achieved by integrating the keywords into the input of MLLMs such as Cog VLM2 (Wang et al. 2023)
Experiment Setup Yes The AdamW optimizer is used with a weight decay of 0.001, a learning rate of 0.001, and a batch size of 512.