Improving Multimodal Learning Balance and Sufficiency through Data Remixing
Authors: Xiaoyu Ma, Hao Chen, Yongjian Deng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our method can be seamlessly integrated with existing approaches, improving accuracy by approximately 6.50% on CREMAD and 3.41% on Kinetic-Sounds, without training set expansion or additional computational overhead during inference. |
| Researcher Affiliation | Academia | 1School of Computer Science and Engineering, Southeast University, Nanjing, China 2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China 3College of Computer Science, Beijing University of Technology, Beijing, China. Correspondence to: Hao Chen <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Method of Data Remixing |
| Open Source Code | No | The source code is available at Data Remixing. |
| Open Datasets | Yes | CREMA-D (Cao et al., 2014) is an audiovisual dataset for emotion recognition... Kinetic-Sounds (Arandjelovic & Zisserman, 2017) is a dataset derived from the Kinetics dataset... |
| Dataset Splits | Yes | The entire dataset is randomly divided into a training and validation set of 6,698 samples and a test set of 744 samples, with a ratio of approximately 9:1. The dataset comprises 19k 10-second video clips, split into 15k for training, 1.9k for validation, and 1.9k for testing. |
| Hardware Specification | Yes | All reported results are averages from three random seeds, with all models trained on two NVIDIA RTX 3090 GPUs using a batch size of 64. |
| Software Dependencies | No | During training, we use the Adam (Kingma, 2014) optimizer with β = (0.9, 0.999) and set the learning rate to 5e-5. |
| Experiment Setup | Yes | During training, we use the Adam (Kingma, 2014) optimizer with β = (0.9, 0.999) and set the learning rate to 5e-5. All reported results are averages from three random seeds, with all models trained on two NVIDIA RTX 3090 GPUs using a batch size of 64. ... Note that our method is applied after a 10-epoch warm-up stage. |