Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Authors: Lehan Wang, Haonan Wang, Honglong Yang, Jiaji Mao, Zehong Yang, Jun Shen, Xiaomeng Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our model can not only accomplish powerful performance across various medical vision-language tasks in bilingual settings, but also recognize and detect structures in multimodal medical scans, boosting the interpretability and user interactivity of medical MLLMs.
Researcher Affiliation Academia 1The Hong Kong University of Science and Technology. 2Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University. Corresponding to Xiaomeng Li (EMAIL).
Pseudocode Yes Algorithm 1 Region-Aligned Evaluation
Open Source Code No Our project page is https://medrega.github.io. The paper provides a project page URL, which typically serves as a demonstration or overview, rather than explicitly stating it hosts the source code for the described methodology.
Open Datasets Yes We first formulate Region-Centric tasks and construct a largescale dataset, Med Reg Instruct...Combining our collected dataset with other medical multimodal corpora for training...MIMIC-CXR dataset (Johnson et al., 2019), and our in-house clinical data...The Region-Text dataset is sourced from SA-Med2D-20M (Ye et al., 2023)...
Dataset Splits Yes For the MIMIC-CXR dataset, we follow previous works (Wu et al., 2023) to utilize both frontal and lateral images...For our in-house dataset, we extract central slices from each 3D scan to formulate the 2D inputs...Following the official split, we use 45,000 samples for training. For single-label classification, Med Reg A outperforms existing models by a large margin from 15.32% to 30.98%.
Hardware Specification Yes The model is trained on 16 NVIDIA H800 GPUs for 1 epoch in the alignment stage and 2 epochs in the instruction tuning stage.
Software Dependencies Yes We employ Intern VL 1.2 (Chen et al., 2024b) as our general-domain foundation to begin training, which is composed of Intern Vi T-6B as the vision encoder, and Nous-Hermes-2-Yi-34B as the language model...We follow the official instruction for finetuning Intern VL, and leverage Lo RA with Deep Speed Ze RO Stage 3 to optimize model parameters.
Experiment Setup Yes Our training process is divided into two steps: alignment training and instruction tuning. During the alignment training phase, we freeze the vision encoder and language model, only fine-tuning the alignment module with medical image captioning datasets...In the instruction tuning stage, we apply both public datasets and our Region-Centric datasets, Med Reg Instruct, to optimize the language model, while keeping the other components unchanged. The language model loss is applied as the loss function...The model is trained on 16 NVIDIA H800 GPUs for 1 epoch in the alignment stage and 2 epochs in the instruction tuning stage.