reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Aligning Multimodal Representations through an Information Bottleneck

Authors: Antonio Almudévar, José Miguel Hernández-Lobato, Sameer Khurana, Ricard Marxer, Alfonso Ortega

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The objectives of this experiment are to empirically validate the different statements made throughout the previous sections and understand the relations between the different elements of our formulation. For this purpose, we use some datasets typically employed in disentanglement related tasks (Wang et al., 2024). Concretely, DSprites (Matthey et al., 2017), MPI3D (Gondal et al., 2019) and Shapes3D (Burgess & Kim, 2018) are used. These datasets contain images and labels that represent multiple independent factors of variation. We jointly train an image encoder and a factors encoder (i.e., images and factors are the two modalities). The reason to use these datasets is that we can control the amount of factors that we input to the encoder, thus controlling the information imbalance between both modalities. Unless otherwise stated, a Res Net20 (He et al., 2016) is used as image encoder, an MLP as encoder for the factors 2 and temperature in the Info NCE loss is a trainable parameter initialized to 0.07. More details are given in Appendix C. [...] Table 1: URR (in percentages) for each dataset and category. [...] Table 2: URR (in percentages) for each dataset and category for Vi T-based encoder. [...] Figure 3: URR (y-axis) for different number of layers of the image neural encoder (x-axis). [...] Figure 5: Alignment (y-axis) vs. ˆI(Zα; Nα) (x-axis). [...] Table 3: CIDEr (Vedantam et al., 2015), BLEU@4 (Papineni et al., 2002) and retrieval accuracy for Q-Formers trained with different loss functions.
Researcher Affiliation	Collaboration	1Vi Vo Lab, Arag on Institute for Engineering Research (I3A), University of Zaragoza, Zaragoza, Spain 2University of Cambridge, Cambridge, UK 3Mitsubishi Electric Research Laboratories (MERL), Cambridge, USA 4Universit e de Toulon, Aix Marseille Univ, CNRS, LIS, Toulon, France. Correspondence to: Antonio Almud evar <EMAIL>.
Pseudocode	No	The paper describes methods and derivations in prose and mathematical notation but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Full implementation details and code are available at: https: //github.com/antonioalmudevar/multimodal_ib.
Open Datasets	Yes	Concretely, DSprites (Matthey et al., 2017), MPI3D (Gondal et al., 2019) and Shapes3D (Burgess & Kim, 2018) are used. [...] COCO (Lin et al., 2014) is used to train and test our model.
Dataset Splits	No	The paper mentions generating scenarios and using datasets for training and testing, but it does not specify exact training/validation/test splits (e.g., percentages or sample counts) for any of the datasets used.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions models like ResNet20, Vision Transformer (ViT-g/14), and BERTbase, but it does not specify software dependencies like programming language versions or library versions (e.g., Python 3.x, PyTorch x.x, CUDA x.x).
Experiment Setup	Yes	Table 4: Hyperparameters of Section 5 DSprites MPI3D Shapes3D factors encoder MLP {16, 16, 8, 8, 16, 16} {128, 128, 64, 64, 128, 128} {64, 64, 32, 32, 64, 64} number of epochs 50 50 50 batch size 128 128 128 optimizer Adam Adam Adam learning rate 0.001 0.001 0.001 scheduler Step Step Step step size (epochs) 20 20 20 scheduler γ 0.3 0.3 0.3 [...] Table 6: Hyperparameters of Section 6 vision encoder VIT-g/14 (Fang et al., 2023) image size 224 # of query tokens 32 cross attention frequency 2 representation dimension 256 text encoder BERTbase(Devlin, 2018) batch size 128 optimizer Adam learning rate 0.0001 optimizer β (0.9, 0.999) scheduler cosine annealing warm-up steps 1000 training steps 50000