Epsilon-VAE: Denoising as Visual Decoding
Authors: Long Zhao, Sanghyun Woo, Ziyu Wan, Yandong Li, Han Zhang, Boqing Gong, Hartwig Adam, Xuhui Jia, Ting Liu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach by assessing both reconstruction (r FID) and generation quality (FID), comparing it to state-of-the-art autoencoding approaches. Our study systematically examines these components through controlled experiments, demonstrating their impact on achieving a high-performing diffusion-based autoencoder. In the experiments that under the standard configuration (Rombach et al., 2022), our method obtains a 40% improvement in terms of reconstruction quality, leading to 22% better image generation quality. More notably, we achieve 2.3 higher inference throughput by increasing compression rates, while keeping competitive generation quality. We evaluate the effectiveness of ϵ-VAE on image reconstruction and generation tasks using the Image Net (Deng et al., 2009). The VAE formulation by Esser et al. (2021) serves as a strong baseline due to its widespread use in modern image generative models (Rombach et al., 2022; Peebles & Xie, 2023; Esser et al., 2024). We perform controlled experiments to compare reconstruction and generation quality by varying model scale, latent dimension, downsampling rates, and input resolution. |
| Researcher Affiliation | Collaboration | Long Zhao 1 Sanghyun Woo 1 Ziyu Wan 1 2 * Yandong Li 1 Han Zhang 1 Boqing Gong 1 Hartwig Adam 1 Xuhui Jia 1 Ting Liu 1 1Google Deep Mind 2City University of Hong Kong. Correspondence to: Long Zhao <EMAIL>, Xuhui Jia <EMAIL>, Ting Liu <EMAIL>. |
| Pseudocode | No | The paper describes methods using mathematical equations and textual descriptions but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: "All models are implemented in JAX/Flax (Bradbury et al., 2018; Heek et al., 2024) and trained on TPU-v5lite pods." However, it does not provide an explicit statement about releasing the code for their method or a link to a code repository. |
| Open Datasets | Yes | We evaluate the effectiveness of ϵ-VAE on image reconstruction and generation tasks using the Image Net (Deng et al., 2009). We evaluate r FID, PSNR and SSIM on the full validation sets of Image Net and COCO-2017 (Lin et al., 2014). |
| Dataset Splits | Yes | We evaluate r FID, PSNR and SSIM on the full validation sets of Image Net and COCO-2017 (Lin et al., 2014), with the results summarized in Tab. 2. |
| Hardware Specification | Yes | All models are implemented in JAX/Flax (Bradbury et al., 2018; Heek et al., 2024) and trained on TPU-v5lite pods. Inference throughputs are computed on a Tesla H100 GPU. |
| Software Dependencies | No | All models are implemented in JAX/Flax (Bradbury et al., 2018; Heek et al., 2024) and trained on TPU-v5lite pods. The paper mentions software frameworks but does not specify their version numbers. |
| Experiment Setup | Yes | The autoencoder loss follows Eq. 1, with weights set to λLPIPS = 0.5 and λadv = 0.5. We use the Adam optimizer (Kingma & Ba, 2015) with β1 = 0 and β2 = 0.999, applying a linear learning rate warmup over the first 5,000 steps, followed by a constant rate of 0.0001 for a total of one million steps. The batch size is 256, with data augmentations including random cropping and horizontal flipping. We follow the setting in Peebles & Xie (2023) to train the latent diffusion models for unconditional image generation on the Image Net dataset. The Di T-XL/2 architecture is used for all experiments. The diffusion hyperparameters from ADM (Dhariwal & Nichol, 2021) are kept. To be specific, we use a tmax = 1000 linear variance schedule ranging from 0.0001 to 0.02, and results are generated using 250 DDPM sampling steps. All models are trained with Adam (Kingma & Ba, 2015) with no weight decay. We use a constant learning rate of 0.0001 and a batch size of 256. Horizontal flipping and random cropping are used for data augmentation. We maintain an exponential moving average of Di T weights over training with a decay of 0.9999. We use identical training hyperparameters across all experiments and train models for one million steps in total. No classifier-free guidance (Ho & Salimans, 2022) is employed in all the experiments. |