Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

Authors: Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Song Han

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on Image Net 512 512, our DC-AE provides 19.1 inference speedup and 17.9 training speedup on H100 GPU for UVi T-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder.
Researcher Affiliation Collaboration Junyu Chen1,2 , Han Cai3 , Junsong Chen3, Enze Xie3, Shang Yang1, Haotian Tang1, Muyang Li1, Song Han1,3 1MIT 2Tsinghua University 3NVIDIA
Pseudocode No The paper includes figures describing architectural components and training pipelines (e.g., Figure 4, Figure 6, Figure 10) but no formal pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/mit-han-lab/efficientvit
Open Datasets Yes We use a mixture of datasets to train autoencoders (baselines and DC-AE), containing Image Net (Deng et al., 2009), SAM (Kirillov et al., 2023), Mapillary Vistas (Neuhold et al., 2017), and FFHQ (Karras et al., 2019).
Dataset Splits Yes For Image Net experiments, we exclusively use the Image Net training split to train autoencoders and diffusion models.
Hardware Specification Yes We profile the training and inference throughput on the H100 GPU with Py Torch and Tensor RT respectively. The latency is measured on the 3090 GPU with batch size 2.
Software Dependencies No The paper mentions using PyTorch and Tensor RT but does not specify version numbers for these software components. It also mentions AdamW optimizer, but no version.
Experiment Setup Yes In phase 1 (low-resolution full training), we use a constant learning rate of 6.4e-5 with a weight decay of 0.1, and Adam W betas of (0.9, 0.999). We use L1 loss and LPIPS loss (Zhang et al., 2018). In phase 2 (high-resolution latent adaptation), we use a constant learning rate of 1.6e-5, a weight decay of 0.001, and Adam W betas of (0.9, 0.999). We use the same loss as phase 1. In phase 3 (low-resolution local refinement), we use a constant learning rate of 5.4e-5, and Adam W betas of (0.5, 0.9). We use L1 loss, LPIPS loss (Zhang et al., 2018), and Patch GAN loss (Isola et al., 2017). The Si T and USi T models are trained for 500k iterations with batch size 1024.