reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

$\textttI$^2$MoE$: Interpretable Multimodal Interaction-aware Mixture-of-Experts

Authors: Jiayi Xin, Sukwon Yun, Jie Peng, Inyoung Choi, Jenna L. Ballard, Tianlong Chen, Qi Long

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluation of medical and general multimodal datasets shows that I2Mo E is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Extensive experiments on five diverse real-world multimodal datasets validate the efficacy of I2Mo E, showcasing significant performance improvements (up to 5.5% in accuracy) and interpretability benefits over vanilla modality fusion methods.
Researcher Affiliation	Academia	1University of Pennsylvania, PA, USA 2University of North Carolina at Chapel Hill, NC, USA 3University of Science and Technology of China, Anhui, China. Correspondence to: Qi Long <EMAIL>, Tianlong Chen <EMAIL>, Jiayi Xin <EMAIL>.
Pseudocode	Yes	We present the training and inference pipeline of I2Mo E in Algorithm 1. The complete learning objective is provided in Appendix C.
Open Source Code	Yes	Code is available at https://github. com/Raina-Xin/I2Mo E.
Open Datasets	Yes	We evaluate our method on five multimodal datasets, using all available modalities while discarding samples with missing data. Two Medical Multimodal Datasets: ADNI (Weiner et al., 2010; 2017) consists of 2,380 samples for Alzheimer s Disease classification (Dementia, Cognitively Normal, or Mild Cognitive Impairment). It includes four modalities: Image (I), Genetic (G), Clinical (C), and Biospecimen (B). MIMIC-IV (Johnson et al., 2023) is a critical care dataset with 9,003 patient records for oneyear mortality prediction (binary classification), utilizing three modalities: Lab (L), Notes (N), and Code (C). Three General Multimodal Datasets: IMDB (Arevalo et al., 2017) includes 25,959 movies for multi-label genre classification across 23 genres, leveraging Image (I) and Language (L) modalities. MOSI (Zadeh et al., 2016) comprises 2,199 annotated You Tube clips for sentiment analysis (regression with scores [-3,3] and then map to binary classification), incorporating Vision (V), Audio (A), and Text (T ) modalities. ENRICO (Leiva et al., 2020) contains 1,460 Android app screens for UI design classification into 20 categories, featuring two modalities: Screenshot (S) and Wireframe (W). Detailed dataset preprocessing is provided in Appendix E.
Dataset Splits	Yes	The dataset is partitioned into training, validation, and testing sets, with 70% allocated for training, 15% for validation, and the remaining 15% for testing.
Hardware Specification	Yes	All experiments were run on a single NVIDIA A100 GPU.
Software Dependencies	No	The paper does not explicitly state specific software dependencies with version numbers.
Experiment Setup	Yes	To improve reproducibility, the tables below provide a summary of the hyperparameters used in our experiments. For hyperparameters of other baseline fusion methods, please refer to the scripts in the Git Hub repository at https://github. com/Raina-Xin/I2Mo E/tree/main/scripts/train_scripts. Table 7. Hyperparameter Configuration for I2Mo E-Mul T on Different Datasets. Hyperparameter ADNI MIMIC IMDB MOSI ENRICO Learning Rate (lr) 0.0001 0.0001 0.0001 0.0001 0.0001 Temperature for Reweighting (temperature rw) 1 2 2.0 2.0 2.0 Hidden Dimension for Reweighting (hidden dim rw) 256 128 256 256 256 Number of Layers in Reweighting (num layer rw) 2 2 3 3 3 Interaction Loss Weight (interaction loss weight) 0.5 0.01 0.5 0.005 0.5 Modality (modality) IGCB LNC LI TVA SW Training Epochs (train epochs) 50 30 40 30 50 Batch Size (batch size) 32 32 32 32 32 Number of Experts (num experts) 8 4 4 4 4 Number of Layers in Encoder (num layers enc) 1 1 1 1 2 Number of Layers in Fusion (num layers fus) 2 2 2 1 2 Number of Layers in Prediction (num layers pred) 2 2 2 1 2 Number of Attention Heads (num heads) 4 1 4 1 4 Hidden Dimension (hidden dim) 256 128 256 256 256 Number of Patches (num patches) 16 8 4 4 8