Diff-MoE: Diffusion Transformer with Time-Aware and Space-Adaptive Experts
Authors: Kun Cheng, Xiao He, Lei Yu, Zhijun Tu, Mingrui Zhu, Nannan Wang, Xinbo Gao, Jie Hu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on image generation benchmarks demonstrate that Diff-Mo E significantly outperforms state-of-the-art methods. Our work demonstrates the potential of integrating diffusion models with expert-based designs, offering a scalable and effective framework for advanced generative modeling. The paper includes performance tables (e.g., Table 2, 3, 4, 5, 6) and figures (e.g., Figure 1, 4, 5, 6, 7, 8) showing metrics like FID and IS, and features an 'Ablation Study' section. |
| Researcher Affiliation | Collaboration | 1State Key Laboratory of Integrated Services Networks, Xidian University 2Huawei Noah s Ark Lab. |
| Pseudocode | No | The paper describes the methodology and architecture using text and diagrams (e.g., Figure 3), but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/ kunncheng/Diff-Mo E. |
| Open Datasets | Yes | We conduct experiments on class-conditional generation tasks using the Image Net dataset (Deng et al., 2009) |
| Dataset Splits | No | We conduct experiments on class-conditional generation tasks using the Image Net dataset (Deng et al., 2009), which contains 1,281,167 training images across 1,000 distinct classes. While it mentions the number of training images for ImageNet, it does not explicitly provide information on how the dataset was split into training, validation, and test sets for their experiments, beyond implicitly referring to the ImageNet training set. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware used for conducting the experiments, such as GPU models, CPU types, or memory configurations. |
| Software Dependencies | No | The paper mentions using a "pre-trained variational autoencoder (VAE) model from Stable Diffusion (Rombach et al., 2022)" and the "Adam W optimizer", but it does not specify version numbers for any software libraries, frameworks, or operating systems used in the implementation. |
| Experiment Setup | Yes | We train all sizes of Diff-Mo E for 400k iterations using the Adam W optimizer with a learning rate of 1e-4. All models are trained with a batch size of 256. Following prior work (Park et al., 2023; Peebles & Xie, 2023), we apply exponential moving average (EMA) to the model parameters during training, with a decay factor of 0.9999, to enhance stability. Rectified flow and expert load balance loss are used by default, with further details provided in the supplementary material. |