$\textttI$^2$MoE$: Interpretable Multimodal Interaction-aware Mixture-of-Experts
Authors: Jiayi Xin, Sukwon Yun, Jie Peng, Inyoung Choi, Jenna L. Ballard, Tianlong Chen, Qi Long
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluation of medical and general multimodal datasets shows that I2Mo E is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Extensive experiments on five diverse real-world multimodal datasets validate the efficacy of I2Mo E, showcasing significant performance improvements (up to 5.5% in accuracy) and interpretability benefits over vanilla modality fusion methods. |
| Researcher Affiliation | Academia | 1University of Pennsylvania, PA, USA 2University of North Carolina at Chapel Hill, NC, USA 3University of Science and Technology of China, Anhui, China. Correspondence to: Qi Long <EMAIL>, Tianlong Chen <EMAIL>, Jiayi Xin <EMAIL>. |
| Pseudocode | Yes | We present the training and inference pipeline of I2Mo E in Algorithm 1. The complete learning objective is provided in Appendix C. |
| Open Source Code | Yes | Code is available at https://github. com/Raina-Xin/I2Mo E. |
| Open Datasets | Yes | We evaluate our method on five multimodal datasets, using all available modalities while discarding samples with missing data. Two Medical Multimodal Datasets: ADNI (Weiner et al., 2010; 2017) consists of 2,380 samples for Alzheimer s Disease classification (Dementia, Cognitively Normal, or Mild Cognitive Impairment). It includes four modalities: Image (I), Genetic (G), Clinical (C), and Biospecimen (B). MIMIC-IV (Johnson et al., 2023) is a critical care dataset with 9,003 patient records for oneyear mortality prediction (binary classification), utilizing three modalities: Lab (L), Notes (N), and Code (C). Three General Multimodal Datasets: IMDB (Arevalo et al., 2017) includes 25,959 movies for multi-label genre classification across 23 genres, leveraging Image (I) and Language (L) modalities. MOSI (Zadeh et al., 2016) comprises 2,199 annotated You Tube clips for sentiment analysis (regression with scores [-3,3] and then map to binary classification), incorporating Vision (V), Audio (A), and Text (T ) modalities. ENRICO (Leiva et al., 2020) contains 1,460 Android app screens for UI design classification into 20 categories, featuring two modalities: Screenshot (S) and Wireframe (W). Detailed dataset preprocessing is provided in Appendix E. |
| Dataset Splits | Yes | The dataset is partitioned into training, validation, and testing sets, with 70% allocated for training, 15% for validation, and the remaining 15% for testing. |
| Hardware Specification | Yes | All experiments were run on a single NVIDIA A100 GPU. |
| Software Dependencies | No | The paper does not explicitly state specific software dependencies with version numbers. |
| Experiment Setup | Yes | To improve reproducibility, the tables below provide a summary of the hyperparameters used in our experiments. For hyperparameters of other baseline fusion methods, please refer to the scripts in the Git Hub repository at https://github. com/Raina-Xin/I2Mo E/tree/main/scripts/train_scripts. Table 7. Hyperparameter Configuration for I2Mo E-Mul T on Different Datasets. Hyperparameter ADNI MIMIC IMDB MOSI ENRICO Learning Rate (lr) 0.0001 0.0001 0.0001 0.0001 0.0001 Temperature for Reweighting (temperature rw) 1 2 2.0 2.0 2.0 Hidden Dimension for Reweighting (hidden dim rw) 256 128 256 256 256 Number of Layers in Reweighting (num layer rw) 2 2 3 3 3 Interaction Loss Weight (interaction loss weight) 0.5 0.01 0.5 0.005 0.5 Modality (modality) IGCB LNC LI TVA SW Training Epochs (train epochs) 50 30 40 30 50 Batch Size (batch size) 32 32 32 32 32 Number of Experts (num experts) 8 4 4 4 4 Number of Layers in Encoder (num layers enc) 1 1 1 1 2 Number of Layers in Fusion (num layers fus) 2 2 2 1 2 Number of Layers in Prediction (num layers pred) 2 2 2 1 2 Number of Attention Heads (num heads) 4 1 4 1 4 Hidden Dimension (hidden dim) 256 128 256 256 256 Number of Patches (num patches) 16 8 4 4 8 |