Aligning Multimodal Representations through an Information Bottleneck

Authors: Antonio Almudévar, José Miguel Hernández-Lobato, Sameer Khurana, Ricard Marxer, Alfonso Ortega

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The objectives of this experiment are to empirically validate the different statements made throughout the previous sections and understand the relations between the different elements of our formulation. For this purpose, we use some datasets typically employed in disentanglement related tasks (Wang et al., 2024). Concretely, DSprites (Matthey et al., 2017), MPI3D (Gondal et al., 2019) and Shapes3D (Burgess & Kim, 2018) are used. These datasets contain images and labels that represent multiple independent factors of variation. We jointly train an image encoder and a factors encoder (i.e., images and factors are the two modalities). The reason to use these datasets is that we can control the amount of factors that we input to the encoder, thus controlling the information imbalance between both modalities. Unless otherwise stated, a Res Net20 (He et al., 2016) is used as image encoder, an MLP as encoder for the factors 2 and temperature in the Info NCE loss is a trainable parameter initialized to 0.07. More details are given in Appendix C. [...] Table 1: URR (in percentages) for each dataset and category. [...] Table 2: URR (in percentages) for each dataset and category for Vi T-based encoder. [...] Figure 3: URR (y-axis) for different number of layers of the image neural encoder (x-axis). [...] Figure 5: Alignment (y-axis) vs. ˆI(Zα; Nα) (x-axis). [...] Table 3: CIDEr (Vedantam et al., 2015), BLEU@4 (Papineni et al., 2002) and retrieval accuracy for Q-Formers trained with different loss functions.
Researcher Affiliation Collaboration 1Vi Vo Lab, Arag on Institute for Engineering Research (I3A), University of Zaragoza, Zaragoza, Spain 2University of Cambridge, Cambridge, UK 3Mitsubishi Electric Research Laboratories (MERL), Cambridge, USA 4Universit e de Toulon, Aix Marseille Univ, CNRS, LIS, Toulon, France. Correspondence to: Antonio Almud evar <EMAIL>.
Pseudocode No The paper describes methods and derivations in prose and mathematical notation but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Full implementation details and code are available at: https: //github.com/antonioalmudevar/multimodal_ib.
Open Datasets Yes Concretely, DSprites (Matthey et al., 2017), MPI3D (Gondal et al., 2019) and Shapes3D (Burgess & Kim, 2018) are used. [...] COCO (Lin et al., 2014) is used to train and test our model.
Dataset Splits No The paper mentions generating scenarios and using datasets for training and testing, but it does not specify exact training/validation/test splits (e.g., percentages or sample counts) for any of the datasets used.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions models like ResNet20, Vision Transformer (ViT-g/14), and BERTbase, but it does not specify software dependencies like programming language versions or library versions (e.g., Python 3.x, PyTorch x.x, CUDA x.x).
Experiment Setup Yes Table 4: Hyperparameters of Section 5 DSprites MPI3D Shapes3D factors encoder MLP {16, 16, 8, 8, 16, 16} {128, 128, 64, 64, 128, 128} {64, 64, 32, 32, 64, 64} number of epochs 50 50 50 batch size 128 128 128 optimizer Adam Adam Adam learning rate 0.001 0.001 0.001 scheduler Step Step Step step size (epochs) 20 20 20 scheduler γ 0.3 0.3 0.3 [...] Table 6: Hyperparameters of Section 6 vision encoder VIT-g/14 (Fang et al., 2023) image size 224 # of query tokens 32 cross attention frequency 2 representation dimension 256 text encoder BERTbase(Devlin, 2018) batch size 128 optimizer Adam learning rate 0.0001 optimizer β (0.9, 0.999) scheduler cosine annealing warm-up steps 1000 training steps 50000