Score-Based Multimodal Autoencoder

Authors: Daniel Wesego, Pedram Rooshenas

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study our proposed methods and selected baselines using an extended version of Poly MNIST (Sutter et al., 2021) as well as high-dimensional Celeb AMask-HQ (Lee et al., 2020) datasets. We compare our methods SBM-VAE and SBM-RAE... We evaluate all methods on both prediction coherence and generative quality. To measure the coherence, we use a pre-trained classifier to extract the label of the generated output and compare it with the associated label of the observed modalities Shi et al. (2019). The coherence of the unconditional generation is evaluated by counting the number of consistent predicted labels from the pre-defined classifier. We also measure the generative quality of the generated modalities using the FID score (Heusel et al., 2017). All the results that are shown are run for at least 3 times and the mean is shown. The standard deviation is shown as shades under the curves in each figure. Figure 3 shows the generated samples from the third modality given the rest.
Researcher Affiliation Academia Daniel Wesego EMAIL Department of Computer Science University of Illinois Chicago Pedram Rooshenas EMAIL Department of Computer Science University of Illinois Chicago
Pseudocode Yes The following algorithms 1, 2 show the training and inference algorithm we use. Algorithm 1 Training... Algorithm 2 Inference...
Open Source Code Yes The code can be found at https://github.com/rooshenasgroup/sbmae
Open Datasets Yes We study our proposed methods and selected baselines using an extended version of Poly MNIST (Sutter et al., 2021) as well as high-dimensional Celeb AMask-HQ (Lee et al., 2020) datasets. ... We use the audio and image modalities from the MHD dataset by Vasco et al. (2022). ... We use partial samples, approximately 100K, from the soundnet dataset (Aytar et al., 2016)
Dataset Splits Yes The extended Poly Mnist dataset was updated from the original Poly Mnist dataset by Sutter et al. (2020) with different background images and ten modalities. It has 50,000 training set, 10,000 validation set, and 10,000 test set.
Hardware Specification Yes We use A100 GPU for computing the time the models take.
Software Dependencies No No specific software dependencies with version numbers are explicitly mentioned in the main text or appendix. The paper mentions using specific algorithms and optimizers (e.g., Adam optimizer (Kingma & Ba, 2015), Predictor-Corrector (PC) sampling algorithm (Song et al., 2020b)) and models (e.g. Hifi-GAN model) but without explicit software version numbers for their implementations or other libraries.
Experiment Setup Yes Hyperparameters and neural network design are discussed in detail in Appendix A.2. ... The VAEs for each modality are trained with an initial learning rate of 0.001 using a ̘ value of 0.1 where all the prior, posterior, and likelihood are Gaussians. ... We use a learning rate of 0.0002 with the Adam optimizer (Kingma & Ba, 2015). The detailed hyperparameters are shown in table 4. We use the VPSDE with ̘0 = 0.1 and ̘1=5 with N = 100 and the PC sampling technique with Euler-Maruyama and Langevin Dynamics. For modalities less than 10, we use ̘0 of 1, the others hyperparameters remain the same.