LLM-RG4: Flexible and Factual Radiology Report Generation Across Diverse Input Contexts
Authors: Zhuhao Wang, Yihua Sun, Zihan Li, Xuan Yang, Fang Chen, Hongen Liao
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that LLM-RG4 achieves stateof-the-art performance in both clinical efficiency and natural language generation on the MIMIC-RG4 and MIMIC-CXR datasets. We quantitatively demonstrate that our model has minimal input-agnostic hallucinations, whereas current opensource models commonly suffer from this problem. |
| Researcher Affiliation | Academia | Zhuhao Wang1, Yihua Sun1, Zihan Li1, Xuan Yang1, Fang Chen2, Hongen Liao1,2* 1School of Biomedical Engineering, Tsinghua University, Beijing, China 2School of Biomedical Engineering, and Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, China EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Detailed Procedure of Token Weight C Input: Report T = [t1, t2, t3, . . . , t L], Che Xbert fc Output: C = [c1, c2, c3, . . . , c L] 1: Initialize ci = 1 2: Get Y = [y1, y2, y3, . . . , y13] = fc(T) 3: For yj in G: 4: if yj = 1 or 1: then 5: c i = IGi(x) 6: ci = max(ci, c i) 7: Define gk = 1 8: Split C into M sentences Cs = [c1, c2, c3, . . . , c M], cn is the nth sentence s weights with length Ln, cn = [cn 1, cn 2, cn 3, . . . , cn Ln]. 9: if cn i > threshold then 10: cn = λ and λ > 1 11: else 12: cn = 1 13: end if 14: return C |
| Open Source Code | Yes | Code https://github.com/zh-Wang-Med/LLM-RG4 |
| Open Datasets | Yes | We utilize the MIMIC-CXR dataset (Johnson et al. 2019), which is the only publicly available dataset that encompasses both multi-view and longitudinal information, to generate the MIMIC-RG4 dataset. |
| Dataset Splits | Yes | Table 1: Percentage (%) of reports with single image no longitudinal setting, that encompass various categories of information. PC: Prior Comparison; PP: Prior Procedure; Comm: Communication; Tr: train; Ts: test. Tr/172.6K 0.30 0.30 0.12 0.00 Val/1.4K 0.07 0.07 0.14 0.00 Ts/2.4K 0.42 0.42 0.04 0.04 |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or processing speeds used for running the experiments. |
| Software Dependencies | Yes | We adopt RAD-DINO (P erez-Garc ıa et al. 2024) as the image encoder and Biomed VLP-CXRBERT (Boecking et al. 2022) as the text encoder, with Vicuna 7B v1.5 (Chiang et al. 2023) as the text decoder. |
| Experiment Setup | Yes | The number of learnable variable tokens in the perceiver is set to 128, threshold is set to 0.4 and λ is set to 1.75. Following LLAVA (Liu et al. 2024b), we employ a two-stage training strategy. Initially, we only train the ATF with sn data to achieve modality alignment. Subsequently, we conduct instruction tuning on the MIMICRG4 dataset, training the ATF, and applying Lo RA (Hu et al. 2021) for fine-tuning Vicuna. |