Multimodal Variational Autoencoder: A Barycentric View

Authors: Peijie Qiu, Wenhui Zhu, Sayantan Kumar, Xiwen Chen, Jin Yang, Xiaotong Sun, Abolfazl Razi, Yalin Wang, Aristeidis Sotiras

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical studies on three multimodal benchmarks demonstrated the effectiveness of the proposed method. Experiments on three benchmark datasets demonstrated the effectiveness of the proposed method compared to other state-of-the-art methods.
Researcher Affiliation Academia 1 Washington University in St. Louis 2 Arizona State University 3 Clemson University 4 University of Arkansas EMAIL EMAIL, EMAIL, EMAIL
Pseudocode No The paper contains mathematical formulations, lemmas, and theorems, but no explicitly labeled pseudocode or algorithm blocks are present.
Open Source Code No For more implementation details (e.g., hyperparameter configurations), we kindly direct the readers to Appendix B. There is no explicit statement about code release or a link to a repository.
Open Datasets Yes We conducted comparative experiments on three multimodal benchmark datasets: i) Poly MNIST with five simplified modalities, ii) the trimodal MNIST-SVHN-TEXT, and iii) the challenging bimodal Celeb A dataset. Poly MNIST was generated by combining each MNIST digit (Le Cun and Cortes 2010) with 28 28 random crops from five distinct background images, as described in (Sutter, Daunhawer, and Vogt 2021). The MNIST-SVHN-TEXT dataset was introduced by (Sutter, Daunhawer, and Vogt 2020), which consists of three modalities: MNIST digit (Le Cun and Cortes 2010), text, and SVHN (Netzer et al. 2011). The bimodal Celeb A includes human face images as well as text describing the face attributes (Liu et al. 2015).
Dataset Splits No The paper refers to "test set" in the evaluation metric section and mentions "20 triples were generated per set" for MNIST-SVHN-TEXT, but it does not provide explicit details about the percentages, counts, or methodology for training, validation, and test splits for any of the datasets used.
Hardware Specification Yes All experiments were performed on a Nvidia-A100 GPU with 40G memory.
Software Dependencies No For a fair comparison, we followed the experimental settings in previous literature (Shi et al. 2019; Sutter, Daunhawer, and Vogt 2021). In particular, we employed the same network architecture as in (Shi et al. 2019; Sutter, Daunhawer, and Vogt 2021). The paper does not provide specific software dependencies with version numbers.
Experiment Setup No For more implementation details (e.g., hyperparameter configurations), we kindly direct the readers to Appendix B. The main text defers hyperparameter configurations to an appendix and refers to previous literature for experimental settings and network architecture, but does not explicitly state them within the main body.