Aligning Multimodal Representations through an Information Bottleneck
Authors: Antonio Almudévar, José Miguel Hernández-Lobato, Sameer Khurana, Ricard Marxer, Alfonso Ortega
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The objectives of this experiment are to empirically validate the different statements made throughout the previous sections and understand the relations between the different elements of our formulation. For this purpose, we use some datasets typically employed in disentanglement related tasks (Wang et al., 2024). Concretely, DSprites (Matthey et al., 2017), MPI3D (Gondal et al., 2019) and Shapes3D (Burgess & Kim, 2018) are used. These datasets contain images and labels that represent multiple independent factors of variation. We jointly train an image encoder and a factors encoder (i.e., images and factors are the two modalities). The reason to use these datasets is that we can control the amount of factors that we input to the encoder, thus controlling the information imbalance between both modalities. Unless otherwise stated, a Res Net20 (He et al., 2016) is used as image encoder, an MLP as encoder for the factors 2 and temperature in the Info NCE loss is a trainable parameter initialized to 0.07. More details are given in Appendix C. [...] Table 1: URR (in percentages) for each dataset and category. [...] Table 2: URR (in percentages) for each dataset and category for Vi T-based encoder. [...] Figure 3: URR (y-axis) for different number of layers of the image neural encoder (x-axis). [...] Figure 5: Alignment (y-axis) vs. ˆI(Zα; Nα) (x-axis). [...] Table 3: CIDEr (Vedantam et al., 2015), BLEU@4 (Papineni et al., 2002) and retrieval accuracy for Q-Formers trained with different loss functions. |
| Researcher Affiliation | Collaboration | 1Vi Vo Lab, Arag on Institute for Engineering Research (I3A), University of Zaragoza, Zaragoza, Spain 2University of Cambridge, Cambridge, UK 3Mitsubishi Electric Research Laboratories (MERL), Cambridge, USA 4Universit e de Toulon, Aix Marseille Univ, CNRS, LIS, Toulon, France. Correspondence to: Antonio Almud evar <EMAIL>. |
| Pseudocode | No | The paper describes methods and derivations in prose and mathematical notation but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Full implementation details and code are available at: https: //github.com/antonioalmudevar/multimodal_ib. |
| Open Datasets | Yes | Concretely, DSprites (Matthey et al., 2017), MPI3D (Gondal et al., 2019) and Shapes3D (Burgess & Kim, 2018) are used. [...] COCO (Lin et al., 2014) is used to train and test our model. |
| Dataset Splits | No | The paper mentions generating scenarios and using datasets for training and testing, but it does not specify exact training/validation/test splits (e.g., percentages or sample counts) for any of the datasets used. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions models like ResNet20, Vision Transformer (ViT-g/14), and BERTbase, but it does not specify software dependencies like programming language versions or library versions (e.g., Python 3.x, PyTorch x.x, CUDA x.x). |
| Experiment Setup | Yes | Table 4: Hyperparameters of Section 5 DSprites MPI3D Shapes3D factors encoder MLP {16, 16, 8, 8, 16, 16} {128, 128, 64, 64, 128, 128} {64, 64, 32, 32, 64, 64} number of epochs 50 50 50 batch size 128 128 128 optimizer Adam Adam Adam learning rate 0.001 0.001 0.001 scheduler Step Step Step step size (epochs) 20 20 20 scheduler γ 0.3 0.3 0.3 [...] Table 6: Hyperparameters of Section 6 vision encoder VIT-g/14 (Fang et al., 2023) image size 224 # of query tokens 32 cross attention frequency 2 representation dimension 256 text encoder BERTbase(Devlin, 2018) batch size 128 optimizer Adam learning rate 0.0001 optimizer β (0.9, 0.999) scheduler cosine annealing warm-up steps 1000 training steps 50000 |