EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

Authors: Haotian Sun, Tao Lei, Bowen Zhang, Yanghao Li, Haoshuo Huang, Ruoming Pang, Bo Dai, Nan Du

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a stateof-the-art Gen Eval score of 71.68% and still maintain competitive inference speed with intuitive interpretability.
Researcher Affiliation Collaboration Haotian Sun1,2 , Tao Lei1, Bowen Zhang1, Yanghao Li1, Haoshuo Huang1, Ruoming Pang1, Bo Dai2, Nan Du1 1Apple AI/ML 2Georgia Institute of Technology
Pseudocode Yes Algorithm 1 Pseudocode of EC-DIT s Routing Layer # B: batch size, S: sequence length, d: hidden dimension # E: number of experts, C: expert capacity # experts: list of length E containing expert FFNs def ec_dit_routing(x_p, W_r, experts):
Open Source Code No To further ensure reproducibility, we plan to release the model weights contingent on the acceptance of this work.
Open Datasets Yes We collect and utilize approximately 1.2 billion text-image pairs from the Internet (Mc Kinzie et al., 2024; Lai et al., 2023). ... To evaluate the image quality of the generated images, we measure zero-shot Fr echet Inception Distance (FID) (Heusel et al., 2017) along with CLIP Score (Hessel et al., 2022) on the MS-COCO 256 256 dataset using 30K samples (Lin et al., 2015). We also provide generated samples from a subset of Partiprompts (Yu et al., 2022) in Appendix E.
Dataset Splits No The paper mentions using the MS-COCO dataset with 30K samples, and collecting 1.2 billion text-image pairs from the Internet, but it does not specify how these datasets were split into training, validation, and test sets. It also mentions using a masking ratio of 0.5 for input sequence length, which is a data augmentation/preprocessing technique, not a dataset split.
Hardware Specification Yes Model training is conducted on v4 and v5p TPUs with a batch size 4096. ... For EC-DIT-M, although the theoretical overhead is around 3%, the actual overhead is measured at 23%. This difference might be attributed to the varying efficiency in inference-time parallelism: EC-DIT-M uses model parallelism to fit on 8 H100 GPUs, whereas the dense model utilizes FSDP.
Software Dependencies No The paper mentions using a 'T5 tokenizer' and 'RMSProp with momentum optimizer' but does not specify any version numbers for these or other software components or libraries. It also mentions 'CLIP-Vi T-big G' which is a model, not a software dependency with a version.
Experiment Setup Yes Model training is conducted on v4 and v5p TPUs with a batch size 4096. We use the RMSProp with momentum optimizer (Hinton, 2012) with a learning rate of 1e-4 and 20K warmup steps. All models are trained with Distributed Data Parallelism (DDP) or Fully Sharded Data Parallel (FSDP) for 800K steps.