reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLM-RG4: Flexible and Factual Radiology Report Generation Across Diverse Input Contexts

Authors: Zhuhao Wang, Yihua Sun, Zihan Li, Xuan Yang, Fang Chen, Hongen Liao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that LLM-RG4 achieves stateof-the-art performance in both clinical efficiency and natural language generation on the MIMIC-RG4 and MIMIC-CXR datasets. We quantitatively demonstrate that our model has minimal input-agnostic hallucinations, whereas current opensource models commonly suffer from this problem.
Researcher Affiliation	Academia	Zhuhao Wang1, Yihua Sun1, Zihan Li1, Xuan Yang1, Fang Chen2, Hongen Liao1,2* 1School of Biomedical Engineering, Tsinghua University, Beijing, China 2School of Biomedical Engineering, and Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, China EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Detailed Procedure of Token Weight C Input: Report T = [t1, t2, t3, . . . , t L], Che Xbert fc Output: C = [c1, c2, c3, . . . , c L] 1: Initialize ci = 1 2: Get Y = [y1, y2, y3, . . . , y13] = fc(T) 3: For yj in G: 4: if yj = 1 or 1: then 5: c i = IGi(x) 6: ci = max(ci, c i) 7: Define gk = 1 8: Split C into M sentences Cs = [c1, c2, c3, . . . , c M], cn is the nth sentence s weights with length Ln, cn = [cn 1, cn 2, cn 3, . . . , cn Ln]. 9: if cn i > threshold then 10: cn = λ and λ > 1 11: else 12: cn = 1 13: end if 14: return C
Open Source Code	Yes	Code https://github.com/zh-Wang-Med/LLM-RG4
Open Datasets	Yes	We utilize the MIMIC-CXR dataset (Johnson et al. 2019), which is the only publicly available dataset that encompasses both multi-view and longitudinal information, to generate the MIMIC-RG4 dataset.
Dataset Splits	Yes	Table 1: Percentage (%) of reports with single image no longitudinal setting, that encompass various categories of information. PC: Prior Comparison; PP: Prior Procedure; Comm: Communication; Tr: train; Ts: test. Tr/172.6K 0.30 0.30 0.12 0.00 Val/1.4K 0.07 0.07 0.14 0.00 Ts/2.4K 0.42 0.42 0.04 0.04
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or processing speeds used for running the experiments.
Software Dependencies	Yes	We adopt RAD-DINO (P erez-Garc ıa et al. 2024) as the image encoder and Biomed VLP-CXRBERT (Boecking et al. 2022) as the text encoder, with Vicuna 7B v1.5 (Chiang et al. 2023) as the text decoder.
Experiment Setup	Yes	The number of learnable variable tokens in the perceiver is set to 128, threshold is set to 0.4 and λ is set to 1.75. Following LLAVA (Liu et al. 2024b), we employ a two-stage training strategy. Initially, we only train the ATF with sn data to achieve modality alignment. Subsequently, we conduct instruction tuning on the MIMICRG4 dataset, training the ATF, and applying Lo RA (Hu et al. 2021) for fine-tuning Vicuna.