Multimodal Variational Autoencoder: A Barycentric View
Authors: Peijie Qiu, Wenhui Zhu, Sayantan Kumar, Xiwen Chen, Jin Yang, Xiaotong Sun, Abolfazl Razi, Yalin Wang, Aristeidis Sotiras
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies on three multimodal benchmarks demonstrated the effectiveness of the proposed method. Experiments on three benchmark datasets demonstrated the effectiveness of the proposed method compared to other state-of-the-art methods. |
| Researcher Affiliation | Academia | 1 Washington University in St. Louis 2 Arizona State University 3 Clemson University 4 University of Arkansas EMAIL EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper contains mathematical formulations, lemmas, and theorems, but no explicitly labeled pseudocode or algorithm blocks are present. |
| Open Source Code | No | For more implementation details (e.g., hyperparameter configurations), we kindly direct the readers to Appendix B. There is no explicit statement about code release or a link to a repository. |
| Open Datasets | Yes | We conducted comparative experiments on three multimodal benchmark datasets: i) Poly MNIST with five simplified modalities, ii) the trimodal MNIST-SVHN-TEXT, and iii) the challenging bimodal Celeb A dataset. Poly MNIST was generated by combining each MNIST digit (Le Cun and Cortes 2010) with 28 28 random crops from five distinct background images, as described in (Sutter, Daunhawer, and Vogt 2021). The MNIST-SVHN-TEXT dataset was introduced by (Sutter, Daunhawer, and Vogt 2020), which consists of three modalities: MNIST digit (Le Cun and Cortes 2010), text, and SVHN (Netzer et al. 2011). The bimodal Celeb A includes human face images as well as text describing the face attributes (Liu et al. 2015). |
| Dataset Splits | No | The paper refers to "test set" in the evaluation metric section and mentions "20 triples were generated per set" for MNIST-SVHN-TEXT, but it does not provide explicit details about the percentages, counts, or methodology for training, validation, and test splits for any of the datasets used. |
| Hardware Specification | Yes | All experiments were performed on a Nvidia-A100 GPU with 40G memory. |
| Software Dependencies | No | For a fair comparison, we followed the experimental settings in previous literature (Shi et al. 2019; Sutter, Daunhawer, and Vogt 2021). In particular, we employed the same network architecture as in (Shi et al. 2019; Sutter, Daunhawer, and Vogt 2021). The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | No | For more implementation details (e.g., hyperparameter configurations), we kindly direct the readers to Appendix B. The main text defers hyperparameter configurations to an appendix and refers to previous literature for experimental settings and network architecture, but does not explicitly state them within the main body. |