Fine-grained Adaptive Visual Prompt for Generative Medical Visual Question Answering

Authors: Ting Yu, Zixuan Tong, Jun Yu, Ke Zhang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on VQARAD, SLAKE, and DME datasets validate the effectiveness of our proposed method, demonstrating its potential in generative Med VQA. Experiments Experiments Setup Datasets and Evaluation Metrics. We fine-tune and evaluate FAVP on three Medical VQA datasets, i.e., VQA-RAD, SLAKE, and DME.
Researcher Affiliation Academia 1School of Information Science and Technology, Hangzhou Normal University, Hangzhou, China EMAIL, EMAIL, EMAIL, EMAIL 2School of Intelligence Science and Engineering, Harbin Institute of Technology (Shenzhen), China 3Key Laboratory of Complex Systems Modeling and Simulation, Hangzhou Dianzi University, Hangzhou, China
Pseudocode No The paper describes the methodology in narrative text and uses diagrams (e.g., Figure 2) but does not include any explicitly structured pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/OpenMICG/FAVP
Open Datasets Yes Extensive experiments on VQARAD, SLAKE, and DME datasets validate the effectiveness of our proposed method, demonstrating its potential in generative Med VQA. Stage 3. We train and evaluate our model on three downstream Med VQA datasets, i.e., VQA-RAD (Lau et al. 2018), SLAKE (Liu et al. 2021), and DME (Tascon-Morales, M arquez-Neila, and Sznitman 2022). Stage 1. To achieve cross-modal alignment between medical images and text, we utilize the radiology part of the ROCO dataset (Pelka et al. 2018). To train our proprietary VQA model, we utilize PMC-VQA (Zhang et al. 2023c) in stage 2, a large-scale dataset that encompasses a broad range of modalities and diseases.
Dataset Splits No The paper mentions evaluating on VQA-RAD, SLAKE, and DME datasets and states 'Following LLa VA-Med (Li et al. 2024), for closed-set questions, we report the accuracy... For open-set questions, we use recall... For the DME dataset, we report overall accuracy and consistency metrics.' However, it does not provide specific percentages or counts for training, validation, and test splits used for these datasets, nor does it explicitly reference predefined splits with citations for reproducibility.
Hardware Specification Yes We conduct all experiments on GeForce RTX 4090 GPUs.
Software Dependencies No The paper mentions using specific models like 'Vi T-G/14' and 'Vicuna 7B' and techniques like 'Lo RA', but does not provide specific version numbers for general software dependencies such as Python, PyTorch, or CUDA, which are essential for full reproducibility.
Experiment Setup Yes Based on preliminary experiments, we establish the Lo RA rank of Vi T at 4 and that of LLM at 8, and we also discuss them in ablation studies. The trainable components of the FAVP consist of a Hierarchical Extractor with 108M parameters and Lo RA layers with 5M parameters, resulting in a total of 113M activation parameters. During training, we employ the Adam W optimizer with a learning rate of 1e-4, following a cosine learning rate schedule. The values of β1 and β2 are set to 0.9 and 0.999, respectively. To enhance model generalization and mitigate overfitting, we apply a weight decay of 0.05. In HAG, images are resized to 224 224 to align with the encoder. Figure 4 investigates the impact of the hyperparameters, specifically the number of keypoints and the NMS threshold τ on the accuracy of answer generation. Empirically, we select keypoints spanning from 30 to 60, while τ ranges cover [0.75, 0.80, 0.85, 0.90].