reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust

Authors: zhuo zhi, Yuxuan Sun, Qiangqiang Wu, Ziquan Liu, Miguel R. D. Rodrigues

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical study unveils that only using the self-attention layer to perform the modality fusion makes the model less robust to missing modalities and input noise, as the model will overly rely on one certain modality. To improve the robustness of the transformer, our paper proposes an implicit approach based on Wasserstein distance that aligns tokens from different modalities without using any additional trainable parameters. Our empirical study shows that the implicit modality alignment improves the effectiveness of the multimodal Transformer in discriminative tasks, as well as its robustness to input noise and missing modalities. We conduct experiments on four downstream task datasets, including 2-modalities and 3-modalities tasks.
Researcher Affiliation	Academia	Zhuo Zhi EMAIL Department of Electronic and Electrical Engineering University College London Yuxuan Sun EMAIL Department of Electronic and Electrical Engineering University College London Qiangqiang Wu EMAIL Department of Computer Science City University of Hong Kong Ziquan Liu EMAIL School of Electronic Engineering and Computer Science Queen Mary University of London Miguel Rodrigues EMAIL Department of Electronic and Electrical Engineering University College London
Pseudocode	No	The paper describes mathematical formulations (equations 5-8) for the optimal transport problem and its application, but it does not include a clearly labeled pseudocode or algorithm block detailing the step-by-step procedure in a structured format.
Open Source Code	No	The paper does not provide any explicit statement about releasing code or a link to a code repository for the methodology described. It only mentions 'Reviewed on Open Review: https: // openreview. net/ forum? id= 2Ika UZd B62' which is a review forum link.
Open Datasets	Yes	Hateful Memes Kiela et al. (2020). This is a binary classification task with two modalities, image and text. MM-IMDb Arevalo et al. (2017). This is a multi-label (25 labels) classification task with two modalities, image and text. UR-FUNNY Hasan et al. (2019). This is a binary classification task with three modalities, text, video and audio. MOSEI Zadeh et al. (2018). This is a regression task with three modalities, text, video and audio. Med Fuse-I Hayat et al. (2022b). This is a real-world dataset which contains EHR and X-ray data for each patient.
Dataset Splits	Yes	Hateful Memes Kiela et al. (2020). The numbers of the samples in the training/val/testing dataset are 8500, 500 and 1000. MM-IMDb Arevalo et al. (2017). The numbers of the samples in the training/val/testing dataset are 32278, 5411 and 16120. UR-FUNNY Hasan et al. (2019). The number of samples in the training/val/testing dataset are 8074, 1034, 1058. MOSEI Zadeh et al. (2018). The number of samples in the training/val/testing dataset are 16265, 1869, 4643. Med Fuse-I Hayat et al. (2022b). The numbers of the samples in the training/val/testing dataset are 18845, 2138 and 5243. We set the noise level to be 0.25, 0.5, 0.75 to the image data, text data and time series data. For the missing modality test, We set 25%, 50% and 75% as the missing proportion and calculate the average performance.
Hardware Specification	Yes	Experiments are running on Tesla V100 GPUs.
Software Dependencies	No	The paper does not explicitly state specific software dependencies with version numbers, such as programming languages or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We set the batch size for Hateful Memes, MM-IMDb, UR-FUNNY and MOSEI as 128, 64, 256, 256. The learning rate search range is [1e-3, 5e-4, 1e-4, 5e-5, 1e-5]. The learning rate strategy is linear decay with warm-up. The search range of α is set as [0.1, 0.2, 1.0, 5.0]. The search range of β is [0.1, 0,2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 4.0, 10.0]. Early stopping with patience 5 is applied for selecting the weight. For early fusion, we use a 12-layer and 12-head transformer and initialize it with the pretrained weight Vi LT-B Kim et al. (2021). For late fusion, we employ a 2-layer and 8-head transformer and randomly initialize it.