Towards Equilibrium: An Instantaneous Probe-and-Rebalance Multimodal Learning Approach
Authors: Yang Yang, Xixian Wu, Qing-Yuan Jiang
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments reveal that our proposed IPRM outperforms all baselines, achieving state-of-the-art (SOTA) performance on numerous widely used datasets. The code is available at https://github.com/njustkmg/IJCAI25-IPRM. |
| Researcher Affiliation | Academia | Nanjing University of Science and Technology EMAIL |
| Pseudocode | Yes | Algorithm 1: The IPRM learning algorithm. |
| Open Source Code | Yes | The code is available at https://github.com/njustkmg/IJCAI25-IPRM. |
| Open Datasets | Yes | We utilize five datasets for experiments, i.e., CREMA-D [Cao et al., 2014], KSounds [Arandjelovic and Zisserman, 2017], NVGesture [Molchanov et al., 2016], IEMOCAP [Busso et al., 2008], and Sarcasm [Cai et al., 2019] datasets. |
| Dataset Splits | Yes | The CREMA-D dataset contains 7,442 clips from 91 actors. And it is divided into a training set with 6,698 samples and a testing set with 744 samples. The KSounds dataset is divided into a training set with 15K samples, a validation set with 1.9K samples, and a testing set with 1.9K samples. For NVGesture dataset, it is split as 1,050 data points for training and 482 for testing. And IEMOCAP dataset is split as a training set with 3,318 samples and a testing set with 1,107 samples. Sarcasm dataset consists of 24,635 and is split as a training set with 19,816 samples, a testing set with 2,409, and a validation set with 2,410 samples. |
| Hardware Specification | Yes | All experiments are conducted on Ge Force RTX 4090 NVIDIA card. |
| Software Dependencies | No | The paper mentions models like ResNet18, I3D, M3AE, CAVMAE, BERT, and CLIP but does not provide specific version numbers for software libraries or frameworks (e.g., PyTorch, TensorFlow, Python) used for implementation. |
| Experiment Setup | Yes | The optimization algorithm for the audio-video and trimodal datasets is stochastic gradient descent (SGD), while Adam is employed for the image-text dataset. The learning rate is set to 10 2 for the audio-video datasets and NVGesture, 10 3 for IEMOCAP, and 10 4 for Sarcasm, respectively. It is then reduced by a factor of 10 when the loss saturates. The batch size is set to be 64 for CREMA-D, KSounds and Sarcasm, while is respectively set to be 2 and 16 for NVGesture and IEMOCAP due to out-of-memory issue. Furthermore, the hyper-parameter α is set to 0.8 for audio-video datasets and 0.7 for trimodal and image-text datasets, based on cross-validation strategy. |