HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

Authors: Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Wei-Hsiang Liao, Yuki Mitsufuji

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our comprehensive experiments on image datasets show that HQ-VAE enhances codebook usage and improves reconstruction performance. We also validated HQ-VAE in terms of its applicability to a different modality with an audio dataset.
Researcher Affiliation Industry Yuhta Takida , Yukara Ikemiya , Takashi Shibuya , Kazuki Shimada , Woosung Choi , Chieh-Hsin Lai , Naoki Murata , Toshimitsu Uesaka , Kengo Uchida , Liao Wei Hsiang , Yuki Mitsufuji , Sony AI, Tokyo, Japan, Sony Group Corporation, Tokyo, Japan
Pseudocode No The paper describes the model architecture and mathematical formulations in prose and equations, but it does not contain any explicit pseudocode blocks, algorithm figures, or similarly structured step-by-step procedures.
Open Source Code Yes The source code is attached in the supplementary material.
Open Datasets Yes We comprehensively examine SQ-VAE-2 and RSQ-VAE and their applicability to generative modeling. In particular, we compare SQ-VAE-2 with VQ-VAE-2 and RSQ-VAE with RQ-VAE to see if our framework improves reconstruction performance relative to the baselines. ... In addition, we test HQ-VAE on an audio dataset to see if it is applicable to a different modality. ... CIFAR10 (Krizhevsky et al., 2009) contains ten classes of 32 32 color images... ... Celeb A-HQ (Karras et al., 2018) contains 30,000 high-resolution face images... ... Image Net (Deng et al., 2009) contains 1000 classes of natural images... ... FFHQ (Karras et al., 2019) contains 70,000 high-resolution face images. ... Urban Sound8K (Salamon et al., 2014) contains 8,732 labeled audio clips...
Dataset Splits Yes CIFAR10 ...separated into 50,000 and 10,000 samples for the training and test sets, respectively. We use the default split and further randomly select 10,000 samples from the training set to prepare the validation set. Celeb A-HQ ...We use the default training/validation/test split (24,183/2,993/2,824 samples). FFHQ ...In Section 5.1, we split the images into three sets: training (60,000 samples), validation (5,000 samples), and test (5,000 samples) sets. ...In Section 5.2, we follow the same preprocessing as in Lee et al. (2022a), wherein the images are split training (60,000 samples) and validation (10,000 samples) sets... Image Net ...We use the default training/validation/test split (1,281,167/50,000/100,000 samples). Urban Sound8K ...Urban Sound8K is divided into ten folds, and we use the folds 1-8/9/10 as the training/validation/test split.
Hardware Specification No The paper mentions running experiments on "4 GPUs" in section F.2.2, but does not specify the model or type of these GPUs (e.g., NVIDIA A100, RTX 3090) or any other specific hardware components.
Software Dependencies No The paper mentions using the "Adam optimizer" and refers to various models and techniques like "Gumbel-softmax", "Hi Fi-GAN vocoder", and "Muse". However, it does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or other key software components that would be necessary for exact reproduction.
Experiment Setup Yes For all the experiments, we use the Adam optimizer with β1 = 0.9 and β2 = 0.9. Unless otherwise noted, we reduce the learning rate in half if the validation loss does not improve in the last three epochs. ... We set the learning rate to 0.001 and train all the models for a maximum of 100 epochs with a mini-batch size of 32. ... We train Image Net and FFHQ for a maximum of 50 and 200 epochs with a mini-batch size of 512 and 128, respectively. ... We gradually reduce the temperature parameter of the Gumbel-softmax trick with a standard schedule τ = exp(10 5 t) ... the balancing parameter β in Equations (7) and (15) is set to 0.25, and the weight decay in EMA for the codebook update is set to 0.99.