LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers
Authors: Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Yanyu Li, Yifan Gong, Kai Zhang, Hao Tan, Jason Kuen, Henghui Ding, Zhihao Shu, Wei Niu, Pu Zhao, Yanzhi Wang, Jiuxiang Gu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Lazy Di T outperforms the DDIM sampler across multiple diffusion transformer models at various resolutions. Furthermore, we implement our method on mobile devices, achieving better performance than DDIM with similar latency. |
| Researcher Affiliation | Collaboration | 1Northeastern University 2Adobe Research 3University of Pennsylvania 4Middle Tennessee State University 5Fudan University 6University of Georgia EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology in regular paragraph text and equations (e.g., Section 3.3 Lazy Learning) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its own source code, nor does it include a link to a code repository for the methodology described. |
| Open Datasets | Yes | We freeze the original model weights and introduce linear layers as lazy learning layers before each MHSA and Feedforward module at every diffusion step. For various sampling steps, these added layers are trained on the Image Net dataset with 500 steps, with a learning rate of 1e-4 and using the Adam W optimizer. |
| Dataset Splits | No | The paper mentions training on the ImageNet dataset with 500 steps and generating 50,000 images per trial for quantitative analysis, but it does not provide specific details on how the ImageNet dataset was split into training, validation, or test sets for reproducibility. |
| Hardware Specification | Yes | The training is conducted on 8 NVIDIA A100 GPUs within 10 minutes. Results are obtained using a smartphone with a Qualcomm Snapdragon 8 Gen 3, featuring a Qualcomm Kryo octa-core CPU, a Qualcomm Adreno GPU, and 16 GB of unified memory. |
| Software Dependencies | No | The paper mentions using OpenCL for the mobile GPU backend but does not specify its version. It also references 'pytorch-Op Counter' in the bibliography, which is an external tool, not a dependency for their implementation. Key software dependencies with specific version numbers (e.g., Python, PyTorch, CUDA) are not provided. |
| Experiment Setup | Yes | For various sampling steps, these added layers are trained on the Image Net dataset with 500 steps, with a learning rate of 1e-4 and using the Adam W optimizer. Following the training pipeline in Di T, we randomly drop some labels, assign a null token for classifier-free guidance, and set a global batch size of 256. We regulate the penalty ratios ρattn and ρfeed for MHSA and Feedforward in Eq. (5) from 1e-7 to 1e-2. Table 1: Di T model results on Image Net (cfg=1.5). |