Robust Multimodal Learning via Cross-Modal Proxy Tokens
Authors: Md Kaykobad Reza, Ameya Patil, Mashhour Solh, Salman Asif
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning. |
| Researcher Affiliation | Collaboration | Md Kaykobad Reza University of California Riverside EMAIL Ameya Patil Amazon EMAIL Mashhour Solh Amazon EMAIL M. Salman Asif University of California Riverside EMAIL |
| Pseudocode | No | The paper describes the proposed approach using textual descriptions and mathematical formulations in Section 3, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for this paper is available at: https://github.com/CSIPlab/Cross-Modal-Proxy-Tokens. |
| Open Datasets | Yes | We evaluate our approach on five popular multimodal datasets across different tasks. A brief description of each dataset is provided here, with further details in Section A2 in the appendix. UPMC Food-101 (Wang et al., 2015) is a multimodal classification dataset... MM-IMDb (Arevalo et al., 2017) is a widely used multimodal dataset... Kinetics-Sound (KS) (Arandjelovic & Zisserman, 2017) is a subset of the Kinetics-400 dataset... Audio-Visual Event (AVE) dataset (Tian et al., 2018) is a benchmark dataset... CREMA-D dataset (Cao et al., 2014) is used for multimodal emotion recognition. |
| Dataset Splits | Yes | UPMC Food-101... divided into a training set of 67,988 pairs and a test set of 22,716 pairs. MM-IMDb... split into 15,552 training, 2,608 validation, and 7,799 test samples... Kinetics-Sound (KS)... a training set containing 14,739 samples and a test set containing 2,594 samples. Audio-Visual Event (AVE)... divided into train/val/test sets containing 3,339/402/402 samples... CREMA-D dataset... contains 6,698 training and 744 test samples. |
| Hardware Specification | Yes | All the models are trained using two NVIDIA RTX 2080Ti GPUs. |
| Software Dependencies | Yes | We use Python 3.8.19 and Py Torch 2.2.2 for training and evaluating our models. |
| Experiment Setup | Yes | For vision-language datasets, we set the learning rate to 10-3 and train the models for 10 epochs with a batch size of 8. For audio-video datasets, the learning rate is set to 5e-5 and models are trained for 100 epochs with a batch size of 4. We use Adam W (Loshchilov & Hutter, 2019) optimizer with epsilon = 10-8 and weight decay = 0.02. While training, we utilize cross entropy loss, polynomial learning rate scheduler with power=0.9 and set the first 5 epoch as warm-up. We set Lo RA (Hu et al., 2022) rank = 1 and insert them after query, key, value and output layers of each transformer block. Further details with all the hyperparameters can be found in Section A3 in the appendix. |