Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust

Authors: zhuo zhi, Yuxuan Sun, Qiangqiang Wu, Ziquan Liu, Miguel R. D. Rodrigues

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical study unveils that only using the self-attention layer to perform the modality fusion makes the model less robust to missing modalities and input noise, as the model will overly rely on one certain modality. To improve the robustness of the transformer, our paper proposes an implicit approach based on Wasserstein distance that aligns tokens from different modalities without using any additional trainable parameters. Our empirical study shows that the implicit modality alignment improves the effectiveness of the multimodal Transformer in discriminative tasks, as well as its robustness to input noise and missing modalities. We conduct experiments on four downstream task datasets, including 2-modalities and 3-modalities tasks.
Researcher Affiliation Academia Zhuo Zhi EMAIL Department of Electronic and Electrical Engineering University College London Yuxuan Sun EMAIL Department of Electronic and Electrical Engineering University College London Qiangqiang Wu EMAIL Department of Computer Science City University of Hong Kong Ziquan Liu EMAIL School of Electronic Engineering and Computer Science Queen Mary University of London Miguel Rodrigues EMAIL Department of Electronic and Electrical Engineering University College London
Pseudocode No The paper describes mathematical formulations (equations 5-8) for the optimal transport problem and its application, but it does not include a clearly labeled pseudocode or algorithm block detailing the step-by-step procedure in a structured format.
Open Source Code No The paper does not provide any explicit statement about releasing code or a link to a code repository for the methodology described. It only mentions 'Reviewed on Open Review: https: // openreview. net/ forum? id= 2Ika UZd B62' which is a review forum link.
Open Datasets Yes Hateful Memes Kiela et al. (2020). This is a binary classification task with two modalities, image and text. MM-IMDb Arevalo et al. (2017). This is a multi-label (25 labels) classification task with two modalities, image and text. UR-FUNNY Hasan et al. (2019). This is a binary classification task with three modalities, text, video and audio. MOSEI Zadeh et al. (2018). This is a regression task with three modalities, text, video and audio. Med Fuse-I Hayat et al. (2022b). This is a real-world dataset which contains EHR and X-ray data for each patient.
Dataset Splits Yes Hateful Memes Kiela et al. (2020). The numbers of the samples in the training/val/testing dataset are 8500, 500 and 1000. MM-IMDb Arevalo et al. (2017). The numbers of the samples in the training/val/testing dataset are 32278, 5411 and 16120. UR-FUNNY Hasan et al. (2019). The number of samples in the training/val/testing dataset are 8074, 1034, 1058. MOSEI Zadeh et al. (2018). The number of samples in the training/val/testing dataset are 16265, 1869, 4643. Med Fuse-I Hayat et al. (2022b). The numbers of the samples in the training/val/testing dataset are 18845, 2138 and 5243. We set the noise level to be 0.25, 0.5, 0.75 to the image data, text data and time series data. For the missing modality test, We set 25%, 50% and 75% as the missing proportion and calculate the average performance.
Hardware Specification Yes Experiments are running on Tesla V100 GPUs.
Software Dependencies No The paper does not explicitly state specific software dependencies with version numbers, such as programming languages or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We set the batch size for Hateful Memes, MM-IMDb, UR-FUNNY and MOSEI as 128, 64, 256, 256. The learning rate search range is [1e-3, 5e-4, 1e-4, 5e-5, 1e-5]. The learning rate strategy is linear decay with warm-up. The search range of α is set as [0.1, 0.2, 1.0, 5.0]. The search range of β is [0.1, 0,2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 4.0, 10.0]. Early stopping with patience 5 is applied for selecting the weight. For early fusion, we use a 12-layer and 12-head transformer and initialize it with the pretrained weight Vi LT-B Kim et al. (2021). For late fusion, we employ a 2-layer and 8-head transformer and randomly initialize it.