Fine-grained Adaptive Visual Prompt for Generative Medical Visual Question Answering
Authors: Ting Yu, Zixuan Tong, Jun Yu, Ke Zhang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on VQARAD, SLAKE, and DME datasets validate the effectiveness of our proposed method, demonstrating its potential in generative Med VQA. Experiments Experiments Setup Datasets and Evaluation Metrics. We fine-tune and evaluate FAVP on three Medical VQA datasets, i.e., VQA-RAD, SLAKE, and DME. |
| Researcher Affiliation | Academia | 1School of Information Science and Technology, Hangzhou Normal University, Hangzhou, China EMAIL, EMAIL, EMAIL, EMAIL 2School of Intelligence Science and Engineering, Harbin Institute of Technology (Shenzhen), China 3Key Laboratory of Complex Systems Modeling and Simulation, Hangzhou Dianzi University, Hangzhou, China |
| Pseudocode | No | The paper describes the methodology in narrative text and uses diagrams (e.g., Figure 2) but does not include any explicitly structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/OpenMICG/FAVP |
| Open Datasets | Yes | Extensive experiments on VQARAD, SLAKE, and DME datasets validate the effectiveness of our proposed method, demonstrating its potential in generative Med VQA. Stage 3. We train and evaluate our model on three downstream Med VQA datasets, i.e., VQA-RAD (Lau et al. 2018), SLAKE (Liu et al. 2021), and DME (Tascon-Morales, M arquez-Neila, and Sznitman 2022). Stage 1. To achieve cross-modal alignment between medical images and text, we utilize the radiology part of the ROCO dataset (Pelka et al. 2018). To train our proprietary VQA model, we utilize PMC-VQA (Zhang et al. 2023c) in stage 2, a large-scale dataset that encompasses a broad range of modalities and diseases. |
| Dataset Splits | No | The paper mentions evaluating on VQA-RAD, SLAKE, and DME datasets and states 'Following LLa VA-Med (Li et al. 2024), for closed-set questions, we report the accuracy... For open-set questions, we use recall... For the DME dataset, we report overall accuracy and consistency metrics.' However, it does not provide specific percentages or counts for training, validation, and test splits used for these datasets, nor does it explicitly reference predefined splits with citations for reproducibility. |
| Hardware Specification | Yes | We conduct all experiments on GeForce RTX 4090 GPUs. |
| Software Dependencies | No | The paper mentions using specific models like 'Vi T-G/14' and 'Vicuna 7B' and techniques like 'Lo RA', but does not provide specific version numbers for general software dependencies such as Python, PyTorch, or CUDA, which are essential for full reproducibility. |
| Experiment Setup | Yes | Based on preliminary experiments, we establish the Lo RA rank of Vi T at 4 and that of LLM at 8, and we also discuss them in ablation studies. The trainable components of the FAVP consist of a Hierarchical Extractor with 108M parameters and Lo RA layers with 5M parameters, resulting in a total of 113M activation parameters. During training, we employ the Adam W optimizer with a learning rate of 1e-4, following a cosine learning rate schedule. The values of β1 and β2 are set to 0.9 and 0.999, respectively. To enhance model generalization and mitigate overfitting, we apply a weight decay of 0.05. In HAG, images are resized to 224 224 to align with the encoder. Figure 4 investigates the impact of the hyperparameters, specifically the number of keypoints and the NMS threshold τ on the accuracy of answer generation. Empirically, we select keypoints spanning from 30 to 60, while τ ranges cover [0.75, 0.80, 0.85, 0.90]. |