Towards Equilibrium: An Instantaneous Probe-and-Rebalance Multimodal Learning Approach

Authors: Yang Yang, Xixian Wu, Qing-Yuan Jiang

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments reveal that our proposed IPRM outperforms all baselines, achieving state-of-the-art (SOTA) performance on numerous widely used datasets. The code is available at https://github.com/njustkmg/IJCAI25-IPRM.
Researcher Affiliation Academia Nanjing University of Science and Technology EMAIL
Pseudocode Yes Algorithm 1: The IPRM learning algorithm.
Open Source Code Yes The code is available at https://github.com/njustkmg/IJCAI25-IPRM.
Open Datasets Yes We utilize five datasets for experiments, i.e., CREMA-D [Cao et al., 2014], KSounds [Arandjelovic and Zisserman, 2017], NVGesture [Molchanov et al., 2016], IEMOCAP [Busso et al., 2008], and Sarcasm [Cai et al., 2019] datasets.
Dataset Splits Yes The CREMA-D dataset contains 7,442 clips from 91 actors. And it is divided into a training set with 6,698 samples and a testing set with 744 samples. The KSounds dataset is divided into a training set with 15K samples, a validation set with 1.9K samples, and a testing set with 1.9K samples. For NVGesture dataset, it is split as 1,050 data points for training and 482 for testing. And IEMOCAP dataset is split as a training set with 3,318 samples and a testing set with 1,107 samples. Sarcasm dataset consists of 24,635 and is split as a training set with 19,816 samples, a testing set with 2,409, and a validation set with 2,410 samples.
Hardware Specification Yes All experiments are conducted on Ge Force RTX 4090 NVIDIA card.
Software Dependencies No The paper mentions models like ResNet18, I3D, M3AE, CAVMAE, BERT, and CLIP but does not provide specific version numbers for software libraries or frameworks (e.g., PyTorch, TensorFlow, Python) used for implementation.
Experiment Setup Yes The optimization algorithm for the audio-video and trimodal datasets is stochastic gradient descent (SGD), while Adam is employed for the image-text dataset. The learning rate is set to 10 2 for the audio-video datasets and NVGesture, 10 3 for IEMOCAP, and 10 4 for Sarcasm, respectively. It is then reduced by a factor of 10 when the loss saturates. The batch size is set to be 64 for CREMA-D, KSounds and Sarcasm, while is respectively set to be 2 and 16 for NVGesture and IEMOCAP due to out-of-memory issue. Furthermore, the hyper-parameter α is set to 0.8 for audio-video datasets and 0.7 for trimodal and image-text datasets, based on cross-validation strategy.