Efficient Quantification of Multimodal Interaction at Sample Level

Authors: Zequn Yang, Hongfa Wang, Di Hu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on synthetic and real-world datasets validate LSMI s precision and efficiency. Crucially, our sample-wise approach reveals fine-grained sampleand category-level dynamics within multimodal data, enabling practical applications such as redundancy-informed sample partitioning, targeted knowledge distillation, and interaction-aware model ensembling.
Researcher Affiliation Collaboration 1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2Tencent Data Platform 3Tsinghua University, Beijing, China 4Beijing Key Laboratory of Research on Large Models and Intelligent Governance, Beijing, China 5Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE, China. Correspondence to: Di Hu <EMAIL>.
Pseudocode Yes Algorithm 1 Lightweight Sample-wise Multimodal Interaction Estimation (LSMI) Algorithm
Open Source Code Yes The code is available at https://github. com/Ge Wu-Lab/LSMI_Estimator.
Open Datasets Yes We conduct experiments on extensive multimodal datasets encompassing various tasks and modalities. These include Food-101 (Bossard et al., 2014), which focuses on food classification using text and image modalities; CREMA-D (Cao et al., 2014), dedicated to emotion analysis with audio and visual modalities; Kinetic-Sounds (KS) (Arandjelovic & Zisserman, 2017), an action recognition task employing audio and visual modalities; UCF101 (Soomro et al., 2012), a multimodal action recognition dataset utilizing RGB and optical flow modalities; CMUMOSEI (Zadeh et al., 2018), which addresses binary sentiment analysis through video (including audio and visual) and text modalities; and UR-funny (Hasan et al., 2019), aimed at humor detection using video and text.
Dataset Splits No The paper uses well-known public datasets (Food-101, CREMA-D, Kinetic-Sounds, UCF101, CMUMOSEI, UR-funny), which typically have predefined splits. However, it does not explicitly state the specific training/validation/test splits, percentages, or sample counts used for the primary experiments. The section on 'Targeted Data Partition' describes splitting based on redundancy ('high-redundancy (High) and low-redundancy (Low) subsets') for a specific application, but this is not the general train/test/validation split information for all experiments.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions using 'KNIFE (Pichler et al., 2022) as the differential entropy estimation' and the 'Image Bind model (Girdhar et al., 2023)', but does not provide specific version numbers for these tools or any other software dependencies like programming languages or deep learning frameworks.
Experiment Setup No The paper describes the overall methodology and algorithm but does not specify concrete experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or optimizer settings for training the models.