reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Dynamic Diffusion Transformer

Authors: Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yibing Song, Gao Huang, Fan Wang, Yang You

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on various datasets and different-sized models verify the superiority of Dy Di T. Notably, with <3% additional fine-tuning iterations, our method reduces the FLOPs of Di T-XL by 51%, accelerates generation by 1.73 , and achieves a competitive FID score of 2.07 on Image Net. The code is publicly available at https://github.com/NUS-HPC-AI-Lab/ Dynamic-Diffusion-Transformer.
Researcher Affiliation	Collaboration	Wangbo Zhao1 Yizeng Han2 Jiasheng Tang2,3 Kai Wang1 Yibing Song2,3 Gao Huang4 Fan Wang2 Yang You1 1National University of Singapore 2DAMO Academy, Alibaba Group 3Hupan Lab 4Tsinghua University
Pseudocode	No	The paper describes the methods using mathematical equations and diagrams, for example, Equations (1), (2), (3), (5), and (6) define operations and losses, and Figure 2 illustrates the architecture. However, there are no explicit blocks labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	The code is publicly available at https://github.com/NUS-HPC-AI-Lab/ Dynamic-Diffusion-Transformer.
Open Datasets	Yes	We mainly conduct experiments on Image Net (Deng et al., 2009) at a resolution of 256 256. To comprehensively evaluate our method, we also assess performance and efficiency on four fine-grained datasets used by Xie et al. (2023): Food (Bossard et al., 2014), Artbench (Liao et al., 2022), Cars (Gebru et al., 2017) and Birds (Wah et al., 2011). ... We evaluate the architecture generalization capability of our method through experiments on UVi T (Bao et al., 2023) ... on the CIFAR-10 dataset (Krizhevsky et al., 2009). ... Our model is initialized using the official Pix Art-α checkpoint fine-tuned on the COCO dataset (Lin et al., 2014).
Dataset Splits	Yes	Following prior works (Peebles & Xie, 2023; Teng et al., 2024), we sample 50,000 images to measure the Fréchet Inception Distance (FID) (Heusel et al., 2017) score... To evaluate the data efficiency of our method, we randomly sampled 10% of the Image Net dataset (Deng et al., 2009) for training.
Hardware Specification	Yes	All experiments are conducted on a server with 8 NVIDIA A800 80G GPUs. More details of model configurations and training setup can be found in Appendix A.1 and A.2, respectively. Following Di T (Peebles & Xie, 2023), the strength of classifier-free guidance (Ho & Salimans, 2022) is set to 1.5 and 4.0 for evaluation and visualization, respectively. Unless otherwise specified, 250 DDPM (Ho et al., 2020) sampling steps are used. All speed tests are performed on an NVIDIA V100 32G GPU.
Software Dependencies	No	optimizer Adam W (Loshchilov, 2017), learning rate=1e-4 global batch size 256 target FLOPs ratio λ [0.9, 0.8, 0.7, 0.5, 0.4, 0.3] [0.9, 0.8, 0.7, 0.5, 0.4, 0.3] [0.7, 0.6, 0.5, 0.3] fine-tuning iterations 50,000 100,000 150,000 for λ = 0.7 200,000 for others warmup iterations 0 0 30,000 augmentation random flip cropping size 224 224. While 'Adam W' is mentioned as an optimizer with a citation, explicit version numbers for general software dependencies like Python, PyTorch, or CUDA are not provided.
Experiment Setup	Yes	Implementation details. Our Dy Di T can be built easily by fine-tuning on pre-trained Di T weights. We experiment on three different-sized Di T models denoted as Di T-S/B/XL. For Di T-XL, we directly adopt the checkpoint from the official Di T repository (Peebles & Xie, 2023), while for Di T-S and Di T-B, we use pre-trained models provided in Pan et al. (2024). All experiments are conducted on a server with 8 NVIDIA A800 80G GPUs. More details of model configurations and training setup can be found in Appendix A.1 and A.2, respectively. Following Di T (Peebles & Xie, 2023), the strength of classifier-free guidance (Ho & Salimans, 2022) is set to 1.5 and 4.0 for evaluation and visualization, respectively. Unless otherwise specified, 250 DDPM (Ho et al., 2020) sampling steps are used. All speed tests are performed on an NVIDIA V100 32G GPU. ... In Table 6, we present the training details of our model on Image Net. For Di T-XL, which is pretrained over 7,000,000 iterations, only 200,000 additional fine-tuning iterations (around 3%) are needed to enable the dynamic architecture (λ = 0.5) with our method. For a higher target FLOPs ratio λ = 0.7, the iterations can be further reduced. model Di T-S Di T-B Di T-XL optimizer Adam W (Loshchilov, 2017), learning rate=1e-4 global batch size 256 target FLOPs ratio λ [0.9, 0.8, 0.7, 0.5, 0.4, 0.3] [0.9, 0.8, 0.7, 0.5, 0.4, 0.3] [0.7, 0.6, 0.5, 0.3] fine-tuning iterations 50,000 100,000 150,000 for λ = 0.7 200,000 for others warmup iterations 0 0 30,000 augmentation random flip cropping size 224 224