EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing
Authors: Haotian Sun, Tao Lei, Bowen Zhang, Yanghao Li, Haoshuo Huang, Ruoming Pang, Bo Dai, Nan Du
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a stateof-the-art Gen Eval score of 71.68% and still maintain competitive inference speed with intuitive interpretability. |
| Researcher Affiliation | Collaboration | Haotian Sun1,2 , Tao Lei1, Bowen Zhang1, Yanghao Li1, Haoshuo Huang1, Ruoming Pang1, Bo Dai2, Nan Du1 1Apple AI/ML 2Georgia Institute of Technology |
| Pseudocode | Yes | Algorithm 1 Pseudocode of EC-DIT s Routing Layer # B: batch size, S: sequence length, d: hidden dimension # E: number of experts, C: expert capacity # experts: list of length E containing expert FFNs def ec_dit_routing(x_p, W_r, experts): |
| Open Source Code | No | To further ensure reproducibility, we plan to release the model weights contingent on the acceptance of this work. |
| Open Datasets | Yes | We collect and utilize approximately 1.2 billion text-image pairs from the Internet (Mc Kinzie et al., 2024; Lai et al., 2023). ... To evaluate the image quality of the generated images, we measure zero-shot Fr echet Inception Distance (FID) (Heusel et al., 2017) along with CLIP Score (Hessel et al., 2022) on the MS-COCO 256 256 dataset using 30K samples (Lin et al., 2015). We also provide generated samples from a subset of Partiprompts (Yu et al., 2022) in Appendix E. |
| Dataset Splits | No | The paper mentions using the MS-COCO dataset with 30K samples, and collecting 1.2 billion text-image pairs from the Internet, but it does not specify how these datasets were split into training, validation, and test sets. It also mentions using a masking ratio of 0.5 for input sequence length, which is a data augmentation/preprocessing technique, not a dataset split. |
| Hardware Specification | Yes | Model training is conducted on v4 and v5p TPUs with a batch size 4096. ... For EC-DIT-M, although the theoretical overhead is around 3%, the actual overhead is measured at 23%. This difference might be attributed to the varying efficiency in inference-time parallelism: EC-DIT-M uses model parallelism to fit on 8 H100 GPUs, whereas the dense model utilizes FSDP. |
| Software Dependencies | No | The paper mentions using a 'T5 tokenizer' and 'RMSProp with momentum optimizer' but does not specify any version numbers for these or other software components or libraries. It also mentions 'CLIP-Vi T-big G' which is a model, not a software dependency with a version. |
| Experiment Setup | Yes | Model training is conducted on v4 and v5p TPUs with a batch size 4096. We use the RMSProp with momentum optimizer (Hinton, 2012) with a learning rate of 1e-4 and 20K warmup steps. All models are trained with Distributed Data Parallelism (DDP) or Fully Sharded Data Parallel (FSDP) for 800K steps. |