reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust Multimodal Learning via Cross-Modal Proxy Tokens

Authors: Md Kaykobad Reza, Ameya Patil, Mashhour Solh, Salman Asif

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning.
Researcher Affiliation	Collaboration	Md Kaykobad Reza University of California Riverside EMAIL Ameya Patil Amazon EMAIL Mashhour Solh Amazon EMAIL M. Salman Asif University of California Riverside EMAIL
Pseudocode	No	The paper describes the proposed approach using textual descriptions and mathematical formulations in Section 3, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code for this paper is available at: https://github.com/CSIPlab/Cross-Modal-Proxy-Tokens.
Open Datasets	Yes	We evaluate our approach on five popular multimodal datasets across different tasks. A brief description of each dataset is provided here, with further details in Section A2 in the appendix. UPMC Food-101 (Wang et al., 2015) is a multimodal classification dataset... MM-IMDb (Arevalo et al., 2017) is a widely used multimodal dataset... Kinetics-Sound (KS) (Arandjelovic & Zisserman, 2017) is a subset of the Kinetics-400 dataset... Audio-Visual Event (AVE) dataset (Tian et al., 2018) is a benchmark dataset... CREMA-D dataset (Cao et al., 2014) is used for multimodal emotion recognition.
Dataset Splits	Yes	UPMC Food-101... divided into a training set of 67,988 pairs and a test set of 22,716 pairs. MM-IMDb... split into 15,552 training, 2,608 validation, and 7,799 test samples... Kinetics-Sound (KS)... a training set containing 14,739 samples and a test set containing 2,594 samples. Audio-Visual Event (AVE)... divided into train/val/test sets containing 3,339/402/402 samples... CREMA-D dataset... contains 6,698 training and 744 test samples.
Hardware Specification	Yes	All the models are trained using two NVIDIA RTX 2080Ti GPUs.
Software Dependencies	Yes	We use Python 3.8.19 and Py Torch 2.2.2 for training and evaluating our models.
Experiment Setup	Yes	For vision-language datasets, we set the learning rate to 10-3 and train the models for 10 epochs with a batch size of 8. For audio-video datasets, the learning rate is set to 5e-5 and models are trained for 100 epochs with a batch size of 4. We use Adam W (Loshchilov & Hutter, 2019) optimizer with epsilon = 10-8 and weight decay = 0.02. While training, we utilize cross entropy loss, polynomial learning rate scheduler with power=0.9 and set the first 5 epoch as warm-up. We set Lo RA (Hu et al., 2022) rank = 1 and insert them after query, key, value and output layers of each transformer block. Further details with all the hyperparameters can be found in Section A3 in the appendix.