reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

What to align in multimodal contrastive learning?

Authors: Benoit Dufumier, Javiera Castillo Navarro, Devis Tuia, Jean-Philippe Thiran

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test Co MM both in a controlled and in a series of real-world settings: in the former, we demonstrate that Co MM effectively captures redundant, unique and synergistic information between modalities. In the latter, we show that Co MM learns complex multimodal interactions and achieves state-of-the-art results on seven multimodal tasks.
Researcher Affiliation	Academia	Benoit Dufumier1,2 Javiera Castillo Navarro1,3 Devis Tuia1 Jean-Philippe Thiran1,4 1 EPFL 2 Neuro Spin, CEA 3 CEDRIC, CNAM 4 Radiology Department, CHUV Contact information: {name.surname}@epfl.ch
Pseudocode	Yes	Algorithm 1 presents the pseudo-code for Co MM s training. It is written in the general case when we have n modalities. It is complementary to Fig. 3 (main paper), which depicts the case for n = 2. Co MM s official implementation is available in this Github repository.
Open Source Code	Yes	Code is available here.
Open Datasets	Yes	To ensure the reproducibility of our work, we have used publicly available datasets from Multi Bench (Liang et al., 2021) (MIMIC, MOSI, UR-FUNNY, MUs TARD and Vision&Touch); MMIMDb (Arevalo et al., 2017), and the synthetic Trifeatures dataset (Hermann & Lampinen, 2020).
Dataset Splits	Yes	To generate our trifeature dataset, we considered the 1 000 combinations of the three features existing in the original dataset (see Appendix E) and split them into 800 combinations for training and 200 for evaluation. To have more variety in the training set for training, each combination was generated 3 times (the shape and the texture were randomly rotated), obtaining a training split of 2 400 images. The bimodal Trifeature dataset used in our experiments was built by considering the trifeature dataset twice (as two separate modalities) and building pairs from these two dataset copies. In total, we get 5 760 000 pairs (2 400 2 400) available for training, and 40 000 (200 200) available for evaluation.
Hardware Specification	Yes	All experiments ran on a single V100 GPU with 32GB of memory.
Software Dependencies	No	For raw images (in Trifeature, MM-IMDb and Vision&Touch), we use the default Sim CLR augmentations (Chen et al., 2020a), which include Random Resized Crop, Color Jitter, Random Grayscale, Gaussian Blur and Random Horizontal Flip (from the Py Torch library).
Experiment Setup	Yes	We use Adam W optimizer (Loshchilov & Hutter, 2019) in all experiments and a learning rate α = 3 10 4 for Trifeature (with weight decay 10 4), α = 10 3 for MIMIC, MOSI, UR-FUNNY and Mus TARD (with weight decay 10 2) and α = 10 4 for MM-IMDb and Vision&Touch (with weight decay 10 2). For MM-IMDb, we also use a cosine scheduler with final value 10 6 and a warmup over 10 epochs. All models were optimized during 100 epochs.