reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Interactive Multimodal Learning via Flat Gradient Modification

Authors: Qing-Yuan Jiang, Zhouyang Chi, Yang Yang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on widely used datasets demonstrate that IGM outperforms various state-of-the-art (SOTA) baselines, achieving superior performance. 4 Experiments
Researcher Affiliation	Academia	Nanjing University of Science and Technology EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Algorithm for IGM
Open Source Code	Yes	The source code is available at https:// github.com/njustkmg/IJCAI25-IGM.
Open Datasets	Yes	We adopt five datasets, i.e., CREMA-D [Cao et al., 2014], Kinetics-Sounds [Arandjelovic and Zisserman, 2017], Twitter2015 [Yu and Jiang, 2019], Sarcasm [Cai et al., 2019], and NVGesture [Molchanov et al., 2016], for evaluation.
Dataset Splits	Yes	CREMA-D consists of 7,442 clips from 91 actors. The clips are divided into 6,698 samples for training and 744 samples for testing. Kinetics-Sounds comprises 31 human action category labels. It is divided into a training set with 15K samples, a validation set with 1.9K samples, and a testing set with 1.9K samples. Twitter2015 contains 5,338 image-text pairs with 3,179 for training, 1,122 for validation, and 1,037 for testing. Sarcasm consists of 24,635 image-text pairs. We split this dataset as 19,816 for training, 2,410 for validation, and 2,409 for testing following the setting of the original paper. NVGesture dataset contains 1,532 dynamic hand gestures. This dataset is divided into 1,050 for training and 482 for testing.
Hardware Specification	Yes	The experiments are performed with an NVIDIA RTX 3090 GPU.
Software Dependencies	No	The paper mentions using Res Net18, BERT, Res Net50, and I3D as backbones/encoders, and SGD/Adam as optimizers, but does not provide specific version numbers for any software libraries or frameworks (e.g., PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	For IGM, we explore a three-layer network, which can be denoted as FC(Dim 256) Re LU FC(256 64) FC(64 c) , as classification head... For audio and video modalities, the dimension of the feature is 512. For image-text modalities and NVGesture dataset, the dimension is 1024. The gradient modification strategy is applied for the classification head for IGM. Furthermore, for IGM, we use SGD as the optimizer for the audio-video and NVGesture datasets, with a momentum of 0.9 and weight decay of 1 10 4. The initial learning rate is set to be 1 10 2, and is divided by 10 when the loss is saturated. For image-text datasets [Yu and Jiang, 2019; Cai et al., 2019], we use Adam as the optimizer, with an initial learning rate of 1 10 5. By using the cross-validation strategy with a validation set, the hyper-parameter scaling factor τ is set to be 0.4 for all datasets. The hyper-parameter ρ is set to be 1 10 15 and 1 10 10 for image/text modality and audio modality, respectively. During calculating cumulative variance, we set batch size as 12 for all datasets except NVGesture. For NVGesture dataset, the batch size is set to 6 due to memory limitation.