Interactive Multimodal Learning via Flat Gradient Modification
Authors: Qing-Yuan Jiang, Zhouyang Chi, Yang Yang
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on widely used datasets demonstrate that IGM outperforms various state-of-the-art (SOTA) baselines, achieving superior performance. 4 Experiments |
| Researcher Affiliation | Academia | Nanjing University of Science and Technology EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Algorithm for IGM |
| Open Source Code | Yes | The source code is available at https:// github.com/njustkmg/IJCAI25-IGM. |
| Open Datasets | Yes | We adopt five datasets, i.e., CREMA-D [Cao et al., 2014], Kinetics-Sounds [Arandjelovic and Zisserman, 2017], Twitter2015 [Yu and Jiang, 2019], Sarcasm [Cai et al., 2019], and NVGesture [Molchanov et al., 2016], for evaluation. |
| Dataset Splits | Yes | CREMA-D consists of 7,442 clips from 91 actors. The clips are divided into 6,698 samples for training and 744 samples for testing. Kinetics-Sounds comprises 31 human action category labels. It is divided into a training set with 15K samples, a validation set with 1.9K samples, and a testing set with 1.9K samples. Twitter2015 contains 5,338 image-text pairs with 3,179 for training, 1,122 for validation, and 1,037 for testing. Sarcasm consists of 24,635 image-text pairs. We split this dataset as 19,816 for training, 2,410 for validation, and 2,409 for testing following the setting of the original paper. NVGesture dataset contains 1,532 dynamic hand gestures. This dataset is divided into 1,050 for training and 482 for testing. |
| Hardware Specification | Yes | The experiments are performed with an NVIDIA RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions using Res Net18, BERT, Res Net50, and I3D as backbones/encoders, and SGD/Adam as optimizers, but does not provide specific version numbers for any software libraries or frameworks (e.g., PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | For IGM, we explore a three-layer network, which can be denoted as FC(Dim 256) Re LU FC(256 64) FC(64 c) , as classification head... For audio and video modalities, the dimension of the feature is 512. For image-text modalities and NVGesture dataset, the dimension is 1024. The gradient modification strategy is applied for the classification head for IGM. Furthermore, for IGM, we use SGD as the optimizer for the audio-video and NVGesture datasets, with a momentum of 0.9 and weight decay of 1 10 4. The initial learning rate is set to be 1 10 2, and is divided by 10 when the loss is saturated. For image-text datasets [Yu and Jiang, 2019; Cai et al., 2019], we use Adam as the optimizer, with an initial learning rate of 1 10 5. By using the cross-validation strategy with a validation set, the hyper-parameter scaling factor τ is set to be 0.4 for all datasets. The hyper-parameter ρ is set to be 1 10 15 and 1 10 10 for image/text modality and audio modality, respectively. During calculating cumulative variance, we set batch size as 12 for all datasets except NVGesture. For NVGesture dataset, the batch size is set to 6 due to memory limitation. |