Modalities Contribute Unequally: Enhancing Medical Multi-modal Learning through Adaptive Modality Token Re-balancing

Authors: Jie Peng, Jenna L. Ballard, Mohan Zhang, Sukwon Yun, Jiayi Xin, Qi Long, Yanyong Zhang, Tianlong Chen

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on both medical and general multi-modal datasets demonstrate the effectiveness and generalizability of AMC. We demonstrate the effectiveness of AMC through extensive experiments on several real-world datasets, including the MIMIC-IV dataset, the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, and a subset of the TCGA benchmark covering five different cancer types.
Researcher Affiliation Academia 1University of Science and Technology of China 2University of Pennsylvania 3University of North Carolina at Chapel Hill. Correspondence to: Yanyong Zhang <EMAIL>.
Pseudocode No The paper describes the operations and steps of AMC within the main text (e.g., Section 4.2 Modality Importance Calculation, Section 4.3 Customized Token Fusion) and through figures (e.g., Figure 2 and 3), but it does not include a distinct, labeled pseudocode or algorithm block.
Open Source Code Yes Code is available at https://github. com/Peng Jieb/amc.
Open Datasets Yes We demonstrate the effectiveness of AMC through extensive experiments on several real-world datasets, including the MIMIC-IV dataset, the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, and a subset of the TCGA benchmark covering five different cancer types. We select Enhanced Rico (ENRICO) dataset (Leiva et al., 2021) evaluates the generalizability of AMC.
Dataset Splits Yes For the dataset split, we use 70% for training, 15% for validation, and 15% for testing.
Hardware Specification Yes All experiments were conducted using RTX 3090 GPUs.
Software Dependencies No The paper describes the implementation and experimental setup but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries.
Experiment Setup Yes Setup. To ensure a fair comparison with baselines, we use the best hyper-parameter settings from the original papers. If these are not available, we conduct hyper-parameter searches, including learning rate, hidden dimension, and batch size, with ranges of [1e-3, 1e-4, 5e-5, 1e-5], [32, 64, 128], and [32, 64, 128], respectively. For our proposed method, we additionally search for the number of experts and the weights of LI, LT and the load balancing loss of SMo E, with ranges of [4, 8, 16], [1.0, 0.1], [1.0, 0.1], and [1.0, 0.1], respectively. The final hyper-parameter settings for AMC are in Appendix B.1. (Referring to Table 7): The hyper-parameter setup for AMC. ADNI MIMIC-IV TCGA ENRICO UCEC LUAD LGG BRCA BLCA Learning rate 1e-4 1e-3 1e-3 1e-3 1e-3 1e-3 1e-3 5e-3 # of Experts 8 8 8 8 8 8 8 8 Top-K 2 2 2 2 2 2 2 2 # of Transformer Layers 2 2 2 2 2 2 2 4 Training Epochs 30 100 30 30 30 30 30 100 Warm-up Epochs 5 10 5 5 5 5 5 5 Hidden dimension 64 64 128 64 64 64 64 128 Batch Size 32 64 64 64 64 64 64 128 # of Attention Heads 8 8 8 8 8 8 8 8