reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Masked Autoencoders Are Effective Tokenizers for Diffusion Models

Authors: Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, Bhiksha Raj

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on Image Net generation using only 128 tokens. Extensive experiments on Image Net (Deng et al., 2009) demonstrate the effectiveness of MAETok.
Researcher Affiliation	Collaboration	1Carnegie Mellon University 2AMD 3The University of Hong Kong 4Peking University 5William & Mary. Correspondence to: Hao Chen <EMAIL>.
Pseudocode	No	The paper describes the model architecture and training objectives using mathematical equations and descriptive text, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and trained models are released1. 1https://github.com/Hhhhhhao/continuous_tokenizer.
Open Datasets	Yes	Extensive experiments on Image Net (Deng et al., 2009) demonstrate the effectiveness of MAETok... Three MAETok variants are trained on 256 256 Image Net (Deng et al., 2009), and 512 512 Image Net, and a subset of 512 512 LAION-COCO (Schuhmann et al., 2022) for 500K iterations, respectively.
Dataset Splits	Yes	We report the reconstruction Frechet Inception Distance (r FID) (Heusel et al., 2017), peak-signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM) on Image Net and MSCOCO (Lin et al., 2014) validation set. This indicates the use of standard, predefined splits for these well-known datasets.
Hardware Specification	Yes	For example, when using 1024 tokens of 512 512 images, the Gflops and the inference throughput of Si T-XL are 373.3 and 0.1 images/second on a single A100, respectively. We train the GMM on the entire Imagenet with a batch size of 256 on a single NVIDIA A8000.
Software Dependencies	No	The paper mentions using the "XQ-GAN codebase (Li et al., 2024d)" and optimizers like "Adam W (Loshchilov, 2017)", but it does not provide specific version numbers for software libraries, programming languages, or other ancillary software dependencies required for replication.
Experiment Setup	Yes	Table 7. Training conﬁguration of MAETok on 256 256 and 512 512 Image Net. Table 8. Training conﬁguration of Si T-XL on 256 256 and 512 512 Image Net. Table 9. Training conﬁguration of Lightning Di T on 256 256 and 512 512 Image Net. These tables provide specific values for image resolution, hidden dimensions, number of heads/layers, patch sizes, optimizers, learning rates, weight decay, batch sizes, learning rate schedules, training steps, augmentation, diffusion samplers, and evaluation metrics.