reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Robust Visual Question Answering via Prompt-Driven Geometric Harmonization

Authors: Yishu Liu, Jiawei Zhu, Congcong Wen, Guangming Lu, Hui Lin, Bingzhi Chen

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on various general and medical VQA datasets demonstrate the consistent superiority of our PDGH approach over existing state-of-the-art baselines.
Researcher Affiliation	Academia	1Harbin Institute of Technology Shenzhen, Shenzhen, China 2Beijing Institute of Technology, Zhuhai, China 3China Academic of Electronics and Information Technology, Beijing, China
Pseudocode	No	The paper describes the methodology using mathematical formulations and textual descriptions, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not include an unambiguous statement about releasing code or a direct link to a source-code repository for the methodology described.
Open Datasets	Yes	In our experiments, we select various out-of-distribution benchmarks to assess the robustness of models against real-world biases, such as VQA-CP v2, VQA-CP v1 (Agrawal et al. 2018), GQAOOD (Kervadec et al. 2021), and VQA-CE (Dancette et al. 2021). Following VQA-CP (Agrawal et al. 2018), we develop a Semantically-Labeled Knowledge-Enhanced under Language Bias (SLAKE-LB) benchmark based on SLAKE (Liu et al. 2021) to verify the performance of our method in the medical domain.
Dataset Splits	Yes	Following VQA-CP (Agrawal et al. 2018), we develop a Semantically-Labeled Knowledge-Enhanced under Language Bias (SLAKE-LB) benchmark based on SLAKE (Liu et al. 2021) to verify the performance of our method in the medical domain. All experiments utilize the standard VQA evaluation metric (Antol et al. 2015). ... We re-split this dataset using the same partitioning ratio as VQA-CP v2 to maintain a consistent sample structure.
Hardware Specification	Yes	In our experiments, we implement the PDGH model on a single RTX 3090 GPU with PyTorch.
Software Dependencies	No	In our experiments, we implement the PDGH model on a single RTX 3090 GPU with PyTorch. ... we use Key BERT (Grootendorst 2020) to extract keywords from the query, which can be formulated as follows, Ki = Wk i,1, Wk i,2, . . . , Wk i,Lk Nk i=1 = Keybert(Q), ... we use Key BERT (Grootendorst 2020) to extract keywords from the query, which can be formulated as follows, Ki = Wk i,1, Wk i,2, . . . , Wk i,Lk Nk i=1 = Keybert(Q), (4) where Q represents the query text, Ki is the i-th key phrase, Lk is the length of key phrases, and Nk is the number of extracted key phrases. Next, we use the extracted keywords to guide the generation of image captions, which can be achieved by integrating the keywords into the input of MLLMs such as Cog VLM2 (Wang et al. 2023)
Experiment Setup	Yes	The AdamW optimizer is used with a weight decay of 0.001, a learning rate of 0.001, and a batch size of 512.