reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Balance-Aware Sequence Sampling Makes Multi-Modal Learning Better

Authors: Zhi-Hao Guan, Qing-Yuan Jiang, Yang Yang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on widely used datasets demonstrate the superiority of our method compared with stateof-the-art (SOTA) baselines. The code is available at https://github.com/njustkmg/IJCAI25-BSS. 1 Introduction ... To support our viewpoint, we conduct a toy experiment on the Twitter2015 dataset to investigate the relationship between different training sequences and MML performance. ... Extensive experiments demonstrate that our proposed method outperforms existing baselines and achieves SOTA performance across widely used datasets. 4 Experiments
Researcher Affiliation	Academia	Zhi-Hao Guan , Qing-Yuan Jiang , Yang Yang Nanjing University of Science and Technology EMAIL
Pseudocode	Yes	Algorithm 1: Multi-modal Learning with Balanceaware Sequence Sampling (BSS).
Open Source Code	Yes	Extensive experiments on widely used datasets demonstrate the superiority of our method compared with stateof-the-art (SOTA) baselines. The code is available at https://github.com/njustkmg/IJCAI25-BSS.
Open Datasets	Yes	Datasets: We validate our proposed method on six widely used datasets, including CREMA-D [Cao et al., 2014], Kinetics-Sounds [Arandjelovic and Zisserman, 2017], VGGSound [Chen et al., 2020], Twitter2015 [Yu and Jiang, 2019], Sarcasm [Cai et al., 2019], and NVGesture [Molchanov et al., 2016].
Dataset Splits	Yes	CREMA-D includes 7,442 video clips across six emotional categories, with 6,698 clips for training and 744 for testing. Kinetics-Sounds is categorized into 31 distinct actions, split into 15,000 for training, 1,900 for validation, and 1,900 for testing. VGGSound provides 168,618 videos for training and validation, along with 13,954 videos for testing. Moreover, Twitter2015 comprises 5,338 text-image pairs, divided into 3,179 for training, 1,122 for validation, and 1,037 for testing. Sarcasm contains 24,635 text-image pairs, allocated as 19,816 for training, 2,410 for validation, and 2,409 for testing. Lastly, NVGesture features three modalities, i.e., RGB, optical flow (OF), and Depth, with 1,050 samples for training and 482 samples for testing.
Hardware Specification	Yes	All models are trained on an NVIDIA GeForce RTX 3090 GPU.
Software Dependencies	No	The paper mentions using specific backbones (ResNet18, ResNet50, BERT, I3D) and optimizers (SGD, Adam) but does not provide specific version numbers for any software libraries or programming environments.
Experiment Setup	Yes	The optimizer for audio-video datasets is stochastic gradient descent (SGD) with a momentum of 0.9 and weight decay of 10-4. The initial learning rate is set to 10-2 and is reduced by a factor of 10 when the loss saturates. The batch size is set to 64 for CREMA-D and Kinetics Sounds, 16 for VGGSound, and 2 for NVGesture. For text-image datasets, we employ the Adam optimizer starting with a learning rate of 10-5, with a batch size of 64. Furthermore, the hyperparameters α and β are set to 0.2 and 0.6, respectively. For the training scheduler, the curriculum period Tgrow and the initial proportion λ0 are configured as 40 and 0.1 under the heuristic setting, while the epoch interval E is configured as 5 under the learning-based setting.