Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
Authors: Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Song Han
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on Image Net 512 512, our DC-AE provides 19.1 inference speedup and 17.9 training speedup on H100 GPU for UVi T-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. |
| Researcher Affiliation | Collaboration | Junyu Chen1,2 , Han Cai3 , Junsong Chen3, Enze Xie3, Shang Yang1, Haotian Tang1, Muyang Li1, Song Han1,3 1MIT 2Tsinghua University 3NVIDIA |
| Pseudocode | No | The paper includes figures describing architectural components and training pipelines (e.g., Figure 4, Figure 6, Figure 10) but no formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/mit-han-lab/efficientvit |
| Open Datasets | Yes | We use a mixture of datasets to train autoencoders (baselines and DC-AE), containing Image Net (Deng et al., 2009), SAM (Kirillov et al., 2023), Mapillary Vistas (Neuhold et al., 2017), and FFHQ (Karras et al., 2019). |
| Dataset Splits | Yes | For Image Net experiments, we exclusively use the Image Net training split to train autoencoders and diffusion models. |
| Hardware Specification | Yes | We profile the training and inference throughput on the H100 GPU with Py Torch and Tensor RT respectively. The latency is measured on the 3090 GPU with batch size 2. |
| Software Dependencies | No | The paper mentions using PyTorch and Tensor RT but does not specify version numbers for these software components. It also mentions AdamW optimizer, but no version. |
| Experiment Setup | Yes | In phase 1 (low-resolution full training), we use a constant learning rate of 6.4e-5 with a weight decay of 0.1, and Adam W betas of (0.9, 0.999). We use L1 loss and LPIPS loss (Zhang et al., 2018). In phase 2 (high-resolution latent adaptation), we use a constant learning rate of 1.6e-5, a weight decay of 0.001, and Adam W betas of (0.9, 0.999). We use the same loss as phase 1. In phase 3 (low-resolution local refinement), we use a constant learning rate of 5.4e-5, and Adam W betas of (0.5, 0.9). We use L1 loss, LPIPS loss (Zhang et al., 2018), and Patch GAN loss (Isola et al., 2017). The Si T and USi T models are trained for 500k iterations with batch size 1024. |