Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

Authors: Xiaoshuang Huang, Lingdong Shen, Jia Liu, Fangxin Shang, Hongxiang Li, Haifeng Huang, Yehui Yang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results indicate that Med PLIB has achieved state-of-the-art outcomes across multiple medical visual language tasks. More importantly, in zero-shot evaluations for the pixel grounding task, Med PLIB leads the best small and large models by margins of 19.7 and 15.6 respectively on the m Dice metric.
Researcher Affiliation Collaboration 1 Baidu Inc 2 China Agricultural University 3 Institute of Automation, Chinese Academy of Sciences 4 Peking University
Pseudocode No The paper describes the method's architecture and multi-stage training verbally and with accompanying diagrams (Figure 2 and Figure 3), but it does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code Yes Code https://github.com/Shawn Huang497/Med PLIB
Open Datasets Yes To advance the research of biomedical MLLMs, we introduce the Medical Complex Vision Question Answering Dataset (Me Co VQA), which comprises an array of 8 modalities for complex medical imaging question answering and image region understanding. ... Open-source. The data, codes, and model checkpoints will be released to the research community. ... We utilize the union of Me Co VQA-R, Me Co VQA-C, SLAKE (Liu et al. 2021), Path VQA (He 2021), PMC-VQA (Zhang et al. 2023), Image Clef2021 (Ben Abacha et al. 2021), Image Clef2019 (Abacha et al. 2019), and VQA-RAD (Lau et al. 2018) in stage II.
Dataset Splits Yes Overall, the training set numbers for Me Co VQA-C, Me Co VQA-R, and Me Co VQA-G are 80k, 126k, and 100k respectively. Additionally, the numbers of their corresponding test sets are 1477, 2633, and 2344, respectively. ... The data volumes for stages I to IV are 330k, 400k, 100k, and 500k, respectively. ... For testing, we extracted 400 samples from the original Me Co VQA-C and Me Co VQA-R test sets and 2000 samples from Omni Med VQA (Hu et al. 2024).
Hardware Specification No The paper mentions training durations for different stages (e.g., '9, 17, 15, and 77 hours, respectively') but does not specify any hardware details like GPU models, CPU types, or memory amounts used for these experiments.
Software Dependencies No The paper references various models and frameworks like 'SAM-Med2D (Cheng et al. 2023)', 'LLa MA-7B (Touvron et al. 2023)', 'CLIP-Large (Radford et al. 2021)', and uses 'GELU activation function (Hendrycks and Gimpel 2016)'. However, it does not provide specific software environment details such as programming language versions (e.g., Python 3.x) or library versions (e.g., PyTorch 1.x, TensorFlow 2.x).
Experiment Setup Yes Model Settings. We employ SAM-Med2D (Cheng et al. 2023) as the pixel encoder and mask decoder. We use LLa MA-7B (Touvron et al. 2023) as a base LLM. Following LLa VA 1.5 (Liu et al. 2024), we utilize CLIP-Large (Radford et al. 2021) as the vision tower and the MLP consists of two linear layers with GELU activation function (Hendrycks and Gimpel 2016). The parameters of the model with 2 experts are 12 Billion. The training durations for stages I to IV are 9, 17, 15, and 77 hours, respectively. ... The optimization objective can be formulated as: L = λreg Lreg + λbce Lbce + λdice Ldice (5) where λreg, λbce, Ldice are the hyperparameters to balance different objectives. ... we unfreeze all parameters, employing Lo RA for fine-tuning through expert mixing. ... Empirical evidence suggests CF=1.5 is optimal, balancing the reduction of information loss and noise in token distribution.