What to align in multimodal contrastive learning?

Authors: Benoit Dufumier, Javiera Castillo Navarro, Devis Tuia, Jean-Philippe Thiran

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test Co MM both in a controlled and in a series of real-world settings: in the former, we demonstrate that Co MM effectively captures redundant, unique and synergistic information between modalities. In the latter, we show that Co MM learns complex multimodal interactions and achieves state-of-the-art results on seven multimodal tasks.
Researcher Affiliation Academia Benoit Dufumier1,2 Javiera Castillo Navarro1,3 Devis Tuia1 Jean-Philippe Thiran1,4 1 EPFL 2 Neuro Spin, CEA 3 CEDRIC, CNAM 4 Radiology Department, CHUV Contact information: {name.surname}@epfl.ch
Pseudocode Yes Algorithm 1 presents the pseudo-code for Co MM s training. It is written in the general case when we have n modalities. It is complementary to Fig. 3 (main paper), which depicts the case for n = 2. Co MM s official implementation is available in this Github repository.
Open Source Code Yes Code is available here.
Open Datasets Yes To ensure the reproducibility of our work, we have used publicly available datasets from Multi Bench (Liang et al., 2021) (MIMIC, MOSI, UR-FUNNY, MUs TARD and Vision&Touch); MMIMDb (Arevalo et al., 2017), and the synthetic Trifeatures dataset (Hermann & Lampinen, 2020).
Dataset Splits Yes To generate our trifeature dataset, we considered the 1 000 combinations of the three features existing in the original dataset (see Appendix E) and split them into 800 combinations for training and 200 for evaluation. To have more variety in the training set for training, each combination was generated 3 times (the shape and the texture were randomly rotated), obtaining a training split of 2 400 images. The bimodal Trifeature dataset used in our experiments was built by considering the trifeature dataset twice (as two separate modalities) and building pairs from these two dataset copies. In total, we get 5 760 000 pairs (2 400 2 400) available for training, and 40 000 (200 200) available for evaluation.
Hardware Specification Yes All experiments ran on a single V100 GPU with 32GB of memory.
Software Dependencies No For raw images (in Trifeature, MM-IMDb and Vision&Touch), we use the default Sim CLR augmentations (Chen et al., 2020a), which include Random Resized Crop, Color Jitter, Random Grayscale, Gaussian Blur and Random Horizontal Flip (from the Py Torch library).
Experiment Setup Yes We use Adam W optimizer (Loshchilov & Hutter, 2019) in all experiments and a learning rate α = 3 10 4 for Trifeature (with weight decay 10 4), α = 10 3 for MIMIC, MOSI, UR-FUNNY and Mus TARD (with weight decay 10 2) and α = 10 4 for MM-IMDb and Vision&Touch (with weight decay 10 2). For MM-IMDb, we also use a cosine scheduler with final value 10 6 and a warmup over 10 epochs. All models were optimized during 100 epochs.