Robust Multimodal Learning via Cross-Modal Proxy Tokens

Authors: Md Kaykobad Reza, Ameya Patil, Mashhour Solh, Salman Asif

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning.
Researcher Affiliation Collaboration Md Kaykobad Reza University of California Riverside EMAIL Ameya Patil Amazon EMAIL Mashhour Solh Amazon EMAIL M. Salman Asif University of California Riverside EMAIL
Pseudocode No The paper describes the proposed approach using textual descriptions and mathematical formulations in Section 3, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code for this paper is available at: https://github.com/CSIPlab/Cross-Modal-Proxy-Tokens.
Open Datasets Yes We evaluate our approach on five popular multimodal datasets across different tasks. A brief description of each dataset is provided here, with further details in Section A2 in the appendix. UPMC Food-101 (Wang et al., 2015) is a multimodal classification dataset... MM-IMDb (Arevalo et al., 2017) is a widely used multimodal dataset... Kinetics-Sound (KS) (Arandjelovic & Zisserman, 2017) is a subset of the Kinetics-400 dataset... Audio-Visual Event (AVE) dataset (Tian et al., 2018) is a benchmark dataset... CREMA-D dataset (Cao et al., 2014) is used for multimodal emotion recognition.
Dataset Splits Yes UPMC Food-101... divided into a training set of 67,988 pairs and a test set of 22,716 pairs. MM-IMDb... split into 15,552 training, 2,608 validation, and 7,799 test samples... Kinetics-Sound (KS)... a training set containing 14,739 samples and a test set containing 2,594 samples. Audio-Visual Event (AVE)... divided into train/val/test sets containing 3,339/402/402 samples... CREMA-D dataset... contains 6,698 training and 744 test samples.
Hardware Specification Yes All the models are trained using two NVIDIA RTX 2080Ti GPUs.
Software Dependencies Yes We use Python 3.8.19 and Py Torch 2.2.2 for training and evaluating our models.
Experiment Setup Yes For vision-language datasets, we set the learning rate to 10-3 and train the models for 10 epochs with a batch size of 8. For audio-video datasets, the learning rate is set to 5e-5 and models are trained for 100 epochs with a batch size of 4. We use Adam W (Loshchilov & Hutter, 2019) optimizer with epsilon = 10-8 and weight decay = 0.02. While training, we utilize cross entropy loss, polynomial learning rate scheduler with power=0.9 and set the first 5 epoch as warm-up. We set Lo RA (Hu et al., 2022) rank = 1 and insert them after query, key, value and output layers of each transformer block. Further details with all the hyperparameters can be found in Section A3 in the appendix.