Masked Autoencoders Are Effective Tokenizers for Diffusion Models

Authors: Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, Bhiksha Raj

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on Image Net generation using only 128 tokens. Extensive experiments on Image Net (Deng et al., 2009) demonstrate the effectiveness of MAETok.
Researcher Affiliation Collaboration 1Carnegie Mellon University 2AMD 3The University of Hong Kong 4Peking University 5William & Mary. Correspondence to: Hao Chen <EMAIL>.
Pseudocode No The paper describes the model architecture and training objectives using mathematical equations and descriptive text, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and trained models are released1. 1https://github.com/Hhhhhhao/continuous_tokenizer.
Open Datasets Yes Extensive experiments on Image Net (Deng et al., 2009) demonstrate the effectiveness of MAETok... Three MAETok variants are trained on 256 256 Image Net (Deng et al., 2009), and 512 512 Image Net, and a subset of 512 512 LAION-COCO (Schuhmann et al., 2022) for 500K iterations, respectively.
Dataset Splits Yes We report the reconstruction Frechet Inception Distance (r FID) (Heusel et al., 2017), peak-signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM) on Image Net and MSCOCO (Lin et al., 2014) validation set. This indicates the use of standard, predefined splits for these well-known datasets.
Hardware Specification Yes For example, when using 1024 tokens of 512 512 images, the Gflops and the inference throughput of Si T-XL are 373.3 and 0.1 images/second on a single A100, respectively. We train the GMM on the entire Imagenet with a batch size of 256 on a single NVIDIA A8000.
Software Dependencies No The paper mentions using the "XQ-GAN codebase (Li et al., 2024d)" and optimizers like "Adam W (Loshchilov, 2017)", but it does not provide specific version numbers for software libraries, programming languages, or other ancillary software dependencies required for replication.
Experiment Setup Yes Table 7. Training configuration of MAETok on 256 256 and 512 512 Image Net. Table 8. Training configuration of Si T-XL on 256 256 and 512 512 Image Net. Table 9. Training configuration of Lightning Di T on 256 256 and 512 512 Image Net. These tables provide specific values for image resolution, hidden dimensions, number of heads/layers, patch sizes, optimizers, learning rates, weight decay, batch sizes, learning rate schedules, training steps, augmentation, diffusion samplers, and evaluation metrics.