Improving Multimodal Learning Balance and Sufficiency through Data Remixing

Authors: Xiaoyu Ma, Hao Chen, Yongjian Deng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our method can be seamlessly integrated with existing approaches, improving accuracy by approximately 6.50% on CREMAD and 3.41% on Kinetic-Sounds, without training set expansion or additional computational overhead during inference.
Researcher Affiliation Academia 1School of Computer Science and Engineering, Southeast University, Nanjing, China 2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China 3College of Computer Science, Beijing University of Technology, Beijing, China. Correspondence to: Hao Chen <EMAIL>.
Pseudocode Yes Algorithm 1 Method of Data Remixing
Open Source Code No The source code is available at Data Remixing.
Open Datasets Yes CREMA-D (Cao et al., 2014) is an audiovisual dataset for emotion recognition... Kinetic-Sounds (Arandjelovic & Zisserman, 2017) is a dataset derived from the Kinetics dataset...
Dataset Splits Yes The entire dataset is randomly divided into a training and validation set of 6,698 samples and a test set of 744 samples, with a ratio of approximately 9:1. The dataset comprises 19k 10-second video clips, split into 15k for training, 1.9k for validation, and 1.9k for testing.
Hardware Specification Yes All reported results are averages from three random seeds, with all models trained on two NVIDIA RTX 3090 GPUs using a batch size of 64.
Software Dependencies No During training, we use the Adam (Kingma, 2014) optimizer with β = (0.9, 0.999) and set the learning rate to 5e-5.
Experiment Setup Yes During training, we use the Adam (Kingma, 2014) optimizer with β = (0.9, 0.999) and set the learning rate to 5e-5. All reported results are averages from three random seeds, with all models trained on two NVIDIA RTX 3090 GPUs using a batch size of 64. ... Note that our method is applied after a 10-epoch warm-up stage.