Balance-Aware Sequence Sampling Makes Multi-Modal Learning Better
Authors: Zhi-Hao Guan, Qing-Yuan Jiang, Yang Yang
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on widely used datasets demonstrate the superiority of our method compared with stateof-the-art (SOTA) baselines. The code is available at https://github.com/njustkmg/IJCAI25-BSS. 1 Introduction ... To support our viewpoint, we conduct a toy experiment on the Twitter2015 dataset to investigate the relationship between different training sequences and MML performance. ... Extensive experiments demonstrate that our proposed method outperforms existing baselines and achieves SOTA performance across widely used datasets. 4 Experiments |
| Researcher Affiliation | Academia | Zhi-Hao Guan , Qing-Yuan Jiang , Yang Yang Nanjing University of Science and Technology EMAIL |
| Pseudocode | Yes | Algorithm 1: Multi-modal Learning with Balanceaware Sequence Sampling (BSS). |
| Open Source Code | Yes | Extensive experiments on widely used datasets demonstrate the superiority of our method compared with stateof-the-art (SOTA) baselines. The code is available at https://github.com/njustkmg/IJCAI25-BSS. |
| Open Datasets | Yes | Datasets: We validate our proposed method on six widely used datasets, including CREMA-D [Cao et al., 2014], Kinetics-Sounds [Arandjelovic and Zisserman, 2017], VGGSound [Chen et al., 2020], Twitter2015 [Yu and Jiang, 2019], Sarcasm [Cai et al., 2019], and NVGesture [Molchanov et al., 2016]. |
| Dataset Splits | Yes | CREMA-D includes 7,442 video clips across six emotional categories, with 6,698 clips for training and 744 for testing. Kinetics-Sounds is categorized into 31 distinct actions, split into 15,000 for training, 1,900 for validation, and 1,900 for testing. VGGSound provides 168,618 videos for training and validation, along with 13,954 videos for testing. Moreover, Twitter2015 comprises 5,338 text-image pairs, divided into 3,179 for training, 1,122 for validation, and 1,037 for testing. Sarcasm contains 24,635 text-image pairs, allocated as 19,816 for training, 2,410 for validation, and 2,409 for testing. Lastly, NVGesture features three modalities, i.e., RGB, optical flow (OF), and Depth, with 1,050 samples for training and 482 samples for testing. |
| Hardware Specification | Yes | All models are trained on an NVIDIA GeForce RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions using specific backbones (ResNet18, ResNet50, BERT, I3D) and optimizers (SGD, Adam) but does not provide specific version numbers for any software libraries or programming environments. |
| Experiment Setup | Yes | The optimizer for audio-video datasets is stochastic gradient descent (SGD) with a momentum of 0.9 and weight decay of 10-4. The initial learning rate is set to 10-2 and is reduced by a factor of 10 when the loss saturates. The batch size is set to 64 for CREMA-D and Kinetics Sounds, 16 for VGGSound, and 2 for NVGesture. For text-image datasets, we employ the Adam optimizer starting with a learning rate of 10-5, with a batch size of 64. Furthermore, the hyperparameters α and β are set to 0.2 and 0.6, respectively. For the training scheduler, the curriculum period Tgrow and the initial proportion λ0 are configured as 40 and 0.1 under the heuristic setting, while the epoch interval E is configured as 5 under the learning-based setting. |