Dynamic Diffusion Transformer
Authors: Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yibing Song, Gao Huang, Fan Wang, Yang You
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on various datasets and different-sized models verify the superiority of Dy Di T. Notably, with <3% additional fine-tuning iterations, our method reduces the FLOPs of Di T-XL by 51%, accelerates generation by 1.73 , and achieves a competitive FID score of 2.07 on Image Net. The code is publicly available at https://github.com/NUS-HPC-AI-Lab/ Dynamic-Diffusion-Transformer. |
| Researcher Affiliation | Collaboration | Wangbo Zhao1 Yizeng Han2 Jiasheng Tang2,3 Kai Wang1 Yibing Song2,3 Gao Huang4 Fan Wang2 Yang You1 1National University of Singapore 2DAMO Academy, Alibaba Group 3Hupan Lab 4Tsinghua University |
| Pseudocode | No | The paper describes the methods using mathematical equations and diagrams, for example, Equations (1), (2), (3), (5), and (6) define operations and losses, and Figure 2 illustrates the architecture. However, there are no explicit blocks labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | The code is publicly available at https://github.com/NUS-HPC-AI-Lab/ Dynamic-Diffusion-Transformer. |
| Open Datasets | Yes | We mainly conduct experiments on Image Net (Deng et al., 2009) at a resolution of 256 256. To comprehensively evaluate our method, we also assess performance and efficiency on four fine-grained datasets used by Xie et al. (2023): Food (Bossard et al., 2014), Artbench (Liao et al., 2022), Cars (Gebru et al., 2017) and Birds (Wah et al., 2011). ... We evaluate the architecture generalization capability of our method through experiments on UVi T (Bao et al., 2023) ... on the CIFAR-10 dataset (Krizhevsky et al., 2009). ... Our model is initialized using the official Pix Art-α checkpoint fine-tuned on the COCO dataset (Lin et al., 2014). |
| Dataset Splits | Yes | Following prior works (Peebles & Xie, 2023; Teng et al., 2024), we sample 50,000 images to measure the Fréchet Inception Distance (FID) (Heusel et al., 2017) score... To evaluate the data efficiency of our method, we randomly sampled 10% of the Image Net dataset (Deng et al., 2009) for training. |
| Hardware Specification | Yes | All experiments are conducted on a server with 8 NVIDIA A800 80G GPUs. More details of model configurations and training setup can be found in Appendix A.1 and A.2, respectively. Following Di T (Peebles & Xie, 2023), the strength of classifier-free guidance (Ho & Salimans, 2022) is set to 1.5 and 4.0 for evaluation and visualization, respectively. Unless otherwise specified, 250 DDPM (Ho et al., 2020) sampling steps are used. All speed tests are performed on an NVIDIA V100 32G GPU. |
| Software Dependencies | No | optimizer Adam W (Loshchilov, 2017), learning rate=1e-4 global batch size 256 target FLOPs ratio λ [0.9, 0.8, 0.7, 0.5, 0.4, 0.3] [0.9, 0.8, 0.7, 0.5, 0.4, 0.3] [0.7, 0.6, 0.5, 0.3] fine-tuning iterations 50,000 100,000 150,000 for λ = 0.7 200,000 for others warmup iterations 0 0 30,000 augmentation random flip cropping size 224 224. While 'Adam W' is mentioned as an optimizer with a citation, explicit version numbers for general software dependencies like Python, PyTorch, or CUDA are not provided. |
| Experiment Setup | Yes | Implementation details. Our Dy Di T can be built easily by fine-tuning on pre-trained Di T weights. We experiment on three different-sized Di T models denoted as Di T-S/B/XL. For Di T-XL, we directly adopt the checkpoint from the official Di T repository (Peebles & Xie, 2023), while for Di T-S and Di T-B, we use pre-trained models provided in Pan et al. (2024). All experiments are conducted on a server with 8 NVIDIA A800 80G GPUs. More details of model configurations and training setup can be found in Appendix A.1 and A.2, respectively. Following Di T (Peebles & Xie, 2023), the strength of classifier-free guidance (Ho & Salimans, 2022) is set to 1.5 and 4.0 for evaluation and visualization, respectively. Unless otherwise specified, 250 DDPM (Ho et al., 2020) sampling steps are used. All speed tests are performed on an NVIDIA V100 32G GPU. ... In Table 6, we present the training details of our model on Image Net. For Di T-XL, which is pretrained over 7,000,000 iterations, only 200,000 additional fine-tuning iterations (around 3%) are needed to enable the dynamic architecture (λ = 0.5) with our method. For a higher target FLOPs ratio λ = 0.7, the iterations can be further reduced. model Di T-S Di T-B Di T-XL optimizer Adam W (Loshchilov, 2017), learning rate=1e-4 global batch size 256 target FLOPs ratio λ [0.9, 0.8, 0.7, 0.5, 0.4, 0.3] [0.9, 0.8, 0.7, 0.5, 0.4, 0.3] [0.7, 0.6, 0.5, 0.3] fine-tuning iterations 50,000 100,000 150,000 for λ = 0.7 200,000 for others warmup iterations 0 0 30,000 augmentation random flip cropping size 224 224 |