Understanding the Emergence of Multimodal Representation Alignment

Authors: Megan Tjandrasuwita, Chanakya Ekbote, Liu Ziyin, Paul Pu Liang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a comprehensive empirical investigation, we demonstrate that both the emergence of alignment and its relationship with task performance depend on several critical data characteristics. ... Through extensive experiments on controlled and real-world datasets with varying degrees of interactions and heterogeneity, we discover several key insights.
Researcher Affiliation Collaboration 1Massachusetts Institute of Technology, USA 2NTT Research, USA.
Pseudocode No No structured pseudocode or algorithm blocks are present in the main text.
Open Source Code Yes Code is released at: https://github. com/Megan Tj/multimodal_alignment.
Open Datasets Yes We use the same dataset and models as Huh et al. (2024), which evaluates alignment on the Wikipedia caption dataset (Srinivasan et al., 2021) with naturally co-occurring text and images. ... We experiment with Multi Bench (Liang et al., 2021) which collects a diverse range of real-world multimodal datasets: MOSEI (Bagher Zadeh et al., 2018), a dataset for predicting emotions from videos (vision, audio, language); MOSI (Zadeh et al., 2016), a dataset for predicting sentiment from videos (vision, audio, language), URFUNNY (Hasan et al., 2019), a humor detection dataset from videos (vision, audio, language); MUSt ARD (Castro et al., 2019), a sarcasm detection dataset from TV shows (vision, audio, language); and AVMNIST (Pérez-Rúa et al., 2019), a dataset for digit classification from paired images and spoken digits. Additionally, we experiment with MM-IMDb (Arevalo et al., 2017), a dataset for classifying movie genres from paired images and text.
Dataset Splits Yes C.3. Multi Bench Dataset ... MUSt ARD (Castro et al., 2019) is a dataset ... There are 414, 138, and 138 video segments in the training, validation, and testing data, which gives a total of 690 data points. MOSI (Zadeh etal., 2016) is a dataset ... resulting in 1284, 229, 686 segments in the train, validation, and testing sets. UR-FUNNY (Hasan et al., 2019) is a large-scale dataset ... There are a total of 10,598, 2,626, and 3,290 segments in the train, validation, and testing sets. MOSEI (Bagher Zadeh et al., 2018) is a large-scale dataset ... There are a total of 16,265, 1,869, and 4,643 segments in the train, validation, and testing sets... AVMNIST (Pérez-Rúa et al., 2019) is a dataset ... resulting in 55000, 5000, and 10000 examples in the train, validation, and test sets respectively. ... Section 6.2 ... we use a subset of 1024 labeled examples for each of the train, validation, and test sets to simulate the scarce data scenario.
Hardware Specification No Acknowledgements MT is supported by the National Science Foundation (NSF) under Grant No. 2141064. We acknowledge NVIDIA’s GPU support. We thank Hengzhi Li and Minseok Jung for feedback and discussions. ... Explanation for No: The paper mentions "NVIDIA's GPU support" but does not specify exact GPU models or other hardware details used for the experiments.
Software Dependencies No The language model families considered are BLOOM (Workshop et al., 2023), Open LLa MA (Geng & Liu, 2023), and LLa MA (Touvron et al., 2023) downloaded from Hugging Face (Wolf et al., 2020). The vision models are vision transformer models ... These models were downloaded from Py Torch Image Models (Wightman, 2019). ... We preprocess the raw FSDD audio into 36 MFCC coefficients with a maximum sequence length of 20 using librosa (Mc Fee et al., 2015). ... Explanation for No: The paper mentions several software tools and frameworks (Hugging Face, PyTorch Image Models, librosa) and model families, but it does not provide specific version numbers for these or other critical software dependencies required for replication (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes A.1. Synthetic Data Experiments On the synthetic dataset, we train MLPs with the Adam W optimizer with the number of hidden dimensions kept the same as the number of input features, 12. For a given level of uniqueuness, we choose suitable hyperparameters across different model depths and transformation depths. Specifically, we tune the learning rate in the range t1e 1, 1e 2, 1e 3, 1e 4u and weight decay in the range t0, 1e 1, 1e 2, 1e 3, 1e 4u for each modality. The depth 1 MLP for the untransformed modality were trained for 50 epochs and the models for the transformed modality were trained for 300 epochs. We use a batch size of 512 for computing alignment. To ensure robustness, we report results with five different random seeds for each dataset. ... A.3. Multi Bench Experiments ... We use the Adam W optimizer. For each dataset, we choose suitable hyperparameters across different model depths and tune the learning rate in the range t1e 3, 5e 4, 1e 4, 5e 5, 1e 5u and and weight decay in the range t0, 1e 1, 1e 2, 1e 3, 1e 4u. ... To ensure robustness, we train each architecture across 3 different seeds... A.4. MM-IMDb Experiments To compute downstream performance, we train linear classifiers on top of the final hidden layer embeddings of the language model described in Appendix A.2 for 100 epochs. We tune the learning rate in the range t5e 3, 1e 3, 5e 4, 1e 4u and weight decay in the range t0, 1e 1, 1e 2, 1e 3, 1e 4u. For finetuning models trained with CLIP, we use a learning rate of 10 4 and a cosine scheduler with final value of 10 6 and a warmup over 10 epochs. Models were optimized for 30 epochs.