reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Quantification of Multimodal Interaction at Sample Level

Authors: Zequn Yang, Hongfa Wang, Di Hu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on synthetic and real-world datasets validate LSMI s precision and efficiency. Crucially, our sample-wise approach reveals fine-grained sampleand category-level dynamics within multimodal data, enabling practical applications such as redundancy-informed sample partitioning, targeted knowledge distillation, and interaction-aware model ensembling.
Researcher Affiliation	Collaboration	1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2Tencent Data Platform 3Tsinghua University, Beijing, China 4Beijing Key Laboratory of Research on Large Models and Intelligent Governance, Beijing, China 5Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE, China. Correspondence to: Di Hu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Lightweight Sample-wise Multimodal Interaction Estimation (LSMI) Algorithm
Open Source Code	Yes	The code is available at https://github. com/Ge Wu-Lab/LSMI_Estimator.
Open Datasets	Yes	We conduct experiments on extensive multimodal datasets encompassing various tasks and modalities. These include Food-101 (Bossard et al., 2014), which focuses on food classification using text and image modalities; CREMA-D (Cao et al., 2014), dedicated to emotion analysis with audio and visual modalities; Kinetic-Sounds (KS) (Arandjelovic & Zisserman, 2017), an action recognition task employing audio and visual modalities; UCF101 (Soomro et al., 2012), a multimodal action recognition dataset utilizing RGB and optical flow modalities; CMUMOSEI (Zadeh et al., 2018), which addresses binary sentiment analysis through video (including audio and visual) and text modalities; and UR-funny (Hasan et al., 2019), aimed at humor detection using video and text.
Dataset Splits	No	The paper uses well-known public datasets (Food-101, CREMA-D, Kinetic-Sounds, UCF101, CMUMOSEI, UR-funny), which typically have predefined splits. However, it does not explicitly state the specific training/validation/test splits, percentages, or sample counts used for the primary experiments. The section on 'Targeted Data Partition' describes splitting based on redundancy ('high-redundancy (High) and low-redundancy (Low) subsets') for a specific application, but this is not the general train/test/validation split information for all experiments.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies	No	The paper mentions using 'KNIFE (Pichler et al., 2022) as the differential entropy estimation' and the 'Image Bind model (Girdhar et al., 2023)', but does not provide specific version numbers for these tools or any other software dependencies like programming languages or deep learning frameworks.
Experiment Setup	No	The paper describes the overall methodology and algorithm but does not specify concrete experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or optimizer settings for training the models.