Fast Training of Diffusion Models with Masked Transformers

Authors: Hongkai Zheng, Weili Nie, Arash Vahdat, Anima Anandkumar

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on Image Net256 256 and Image Net-512 512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (Di T) model, using only around 30% of its original training time. Thus, our method shows a promising way of efficiently training large transformer-based diffusion models without sacrificing the generative performance.
Researcher Affiliation Collaboration Hongkai Zheng EMAIL Caltech Weili Nie EMAIL NVIDIA Arash Vahdat EMAIL NVIDIA Anima Anandkumar EMAIL Caltech
Pseudocode No The paper describes the methodology using textual explanations and mathematical formulas (e.g., LDSM, LMAE, L), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/Anima-Lab/Mask Di T.
Open Datasets Yes Experiments on Image Net256 256 and Image Net-512 512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (Di T) model
Dataset Splits No The paper mentions using Image Net 256 256 and Image Net 512 512 datasets but does not explicitly provide specific details on training, validation, or test splits (e.g., percentages, sample counts, or explicit references to predefined splits beyond the dataset names themselves).
Hardware Specification Yes Unless otherwise noted, experiments on Image Net 256 256 are conducted on 8 A100 GPUs, each with 80GB memory, whereas for Image Net 512 512, we use 32 A100 GPUs.
Software Dependencies No The paper mentions using the 'pre-trained VAE model from Stable Diffusion (Rombach et al., 2022)' and 'ADM s Tensor Flow evaluation suite (Dhariwal & Nichol, 2021)' but does not provide specific version numbers for these software components or other libraries like Python, PyTorch, or CUDA.
Experiment Setup Yes Most training details are kept the same with the Di T work: Adam W (Loshchilov & Hutter, 2017) with a constant learning rate of 1e-4, no weight decay, and an exponential moving average (EMA) of model weights over training with a decay of 0.9999. Also, we use the same initialization strategies with Di T. By default, we use a masking ratio of 50%, an MAE coefficient λ = 0.1, a probability of dropping class labels puncond = 0.1, and a batch size of 1024. For the unmasked tuning, we change the learning rate to 5e-5 and use full precision for better training stability.