Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders

Authors: Rogelio A. Mancisidor, Robert Jenssen, Shujian Yu, Michael Kampffmeyer

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The empirical analysis in this research shows that, when dependence between experts is considered, Co DE-VAE exhibits better performance in terms of balancing the trade-off between generative coherence and generative quality, as well as generating more precise log-likelihood estimations. Furthermore, Co DE-VAE minimizes the generative quality gap as the number of modalities increases, achieving unconditional FID scores similar to unimodal VAEs, which is a desirable property that is lacking in most current models. Finally, Co DE-VAE achieves a classification accuracy that is comparable to that of current state-of-the-art multimodal VAEs.
Researcher Affiliation Academia 1Department of Data Science, BI Norwegian Business School, Oslo, Norway 2Department of Physics and Technology, UiT The Arctic University of Norway, Tromsø, Norway 3Department of Computer Science, University of Copenhagen, Copenhagen, Denmark 4Dept. BAMJO, Norwegian Computing Center, Oslo, Norway 5Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, Netherlands.
Pseudocode Yes Algorithm 1 Minibatch version of the Co DE-VAE algorithm.
Open Source Code Yes Co DE-VAE is available at: https://github.com/rogelioamancisidor/codevae.
Open Datasets Yes Following the standard experimental setup in this domain (Sutter et al., 2021; Daunhawer et al., 2022; Palumbo et al., 2023), performance is evaluated using the following multimodal datasets: MNIST-SVHN-Text dataset (Sutter et al., 2020) composed of matching MNIST and SVHN digits and a text describing the digit; Poly MNIST (Sutter et al., 2021) composed of 5 MNIST images of the same digit, but different background and handwriting style; and The Caltech Birds (CUB) dataset (Wah et al., 2011; Daunhawer et al., 2022), which is composed of images of birds paired with captions describing each bird.
Dataset Splits Yes The triples are created in a many-to-many mapping, therefore, there are 1,121,360 and 200,000 observations in the train and test sets, respectively. [...] Training and test sets have 60,000 and 10,000 images, respectively. [...] there are in total 117,880 pairs of image-captions, where 88,550 are used for model training and 29,330 for testing. [...] The train and set sets have 162,560 and 19,712 observations, respectively.
Hardware Specification Yes Models are trained on single A100 GPUs with AMD EPYC Milan processors with 24 cores. [...] Finally, we acknowledge Sigma2 (Norway) for awarding this project access to the LUMI supercomputer, owned by the Euro HPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through project no. NN10040K.
Software Dependencies No The paper mentions the Adam optimizer (Kingma & Ba, 2017), Logistic Regression class in sklearn, Tensor Flow pre-trained inception network, and the cv2 library, but it does not specify version numbers for any of these software components or libraries.
Experiment Setup Yes We train our Co DE-VAE model with the Adam optimizer with default values and a learning rate of 0.001, using mixed-precision to speed up model training. Both image modalities are assumed to have Laplace likelihoods, whereas the text modality is assumed to have a categorical likelihood. The dimension of the latent space is set to 20, as in (Sutter et al., 2020; 2021; Palumbo et al., 2023; Mancisidor et al., 2024). [...] The β value is found by cross-validation using the values [0.1, 1, 5, 10, 15, 20]. We also consider β = 2.5 in the Poly MNIST data to replicate the setting in (Palumbo et al., 2023). For all datasets, we assume that the prior distribution is an isotropic Gaussian distribution, and the expert distributions are assumed to be multivariate Gaussian distributions with diagonal covariance matrix.