DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents
Authors: Kushagra Pandey, Avideep Mukherjee, Piyush Rai, Abhishek Kumar
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now investigate several properties of the Diffuse VAE model. We use a mix of qualitative and quantitative evaluations for demonstrating these properties on several image synthesis benchmarks including CIFAR-10 (Krizhevsky, 2009), Celeb A-64 (Liu et al., 2015), Celeb A-HQ (Karras et al., 2018) and LHQ-256 (Skorokhodov et al., 2021) datasets. For quantitative evaluations involving sample quality, we use the FID (Heusel et al., 2018) metric. |
| Researcher Affiliation | Collaboration | Kushagra Pandey EMAIL Department of Computer Science University of California, Irvine. Avideep Mukherjee EMAIL Department of Computer Science Indian Institute of Technology, Kanpur. Piyush Rai EMAIL Department of Computer Science Indian Institute of Technology, Kanpur. Abhishek Kumar EMAIL Google Research, Brain Team |
| Pseudocode | Yes | Algorithm 1 DDPM Training (Form. 2). Algorithm 2 DDPM Inference (Form. 2) |
| Open Source Code | Yes | For reproducibility, our source code is publicly available at https://github.com/kpandey008/Diffuse VAE. |
| Open Datasets | Yes | We use a mix of qualitative and quantitative evaluations for demonstrating these properties on several image synthesis benchmarks including CIFAR-10 (Krizhevsky, 2009), Celeb A-64 (Liu et al., 2015), Celeb A-HQ (Karras et al., 2018) and LHQ-256 (Skorokhodov et al., 2021) datasets. |
| Dataset Splits | Yes | For quantitative evaluations involving sample quality, we use the FID (Heusel et al., 2018) metric. We also report the Inception Score (IS) metric (Salimans et al., 2016) for stateof-the-art comparisons on CIFAR-10. For all the experiments, we set the number of diffusion time-steps (T) to 1000 during training. The noise schedule in the DDPM forward process was set to a linear schedule between β1 = 10 4 and β2 = 0.02 during training. More details regarding the model and training hyperparameters can be found in Appendix F. Some additional experimental results are presented in Appendix G. ... For Celeb A-HQ 256 comparisons we computed FID scores on 30k samples since the Celeb A-HQ dataset contains 30k images. |
| Hardware Specification | Yes | We used a mix of 4 Nvidia 1080Ti GPUs (44GB memory), a cloud TPUv2-8 (64GB memory) and a cloud TPUv3-8 (128GB memory) for training the models. Specifically, we used the GPU setup for training our CIFAR-10 and Celeb A-64 models while we utilized the TPUv2-8 for training Celeb A-HQ models at the 128 x 128 resolutions. Finally, we utilized the TPUv3-8 model for training on Celeb A-HQ and LHQ models at 256 x 256 resolution. |
| Software Dependencies | No | The paper mentions using 'U-Net...from (Nichol & Dhariwal, 2021)' and 'U-Net...from DDIM (Song et al., 2021a) (https://github.com/ermongroup/ddim/blob/main/models/diffusion.py)' and 'torch-fidelity (Obukhov et al., 2020) package' but does not provide specific version numbers for the programming language or other key libraries used. |
| Experiment Setup | Yes | All hyperparameters details related to VAE and DDPM training in Diffuse VAE are listed in Table 9. Moreover, all hyperparameters (model and training) were shared between both Diffuse VAE formulations. ... For all the experiments, we set the number of diffusion time-steps (T) to 1000 during training. The noise schedule in the DDPM forward process was set to a linear schedule between β1 = 10 4 and β2 = 0.02 during training. ... Latent code size was set to 1024 for LHQ-256 and Celeb A-HQ (both 128 and 256 resolution variants) and 512 for the CIFAR-10 and Celeb A (64 x 64) datasets. ... Effective Batch Size 128 ... Optimizer Adam(lr=1e-4) ... KL-weight 1.0 ... Dropout 0.3 ... Noise Schedule (default) Linear(1e-4, 0.02) ... EMA decay rate 0.9999 ... Grad. Clip Threshold 1.0 ... # of lr annealing steps 5000 ... Diffusion loss type Noise prediction (L2). |