AdaDiff: Adaptive Step Selection for Fast Diffusion Models
Authors: Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, Yu-Gang Jiang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline using a fixed 50 denoising steps while reducing inference time by at least 33%, going as high as 40%. |
| Researcher Affiliation | Collaboration | 1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2Shanghai Collaborative Innovation Center of Intelligent Visual Computing 3Byte Dance Inc. |
| Pseudocode | No | The paper describes the methodology using natural language and mathematical equations (e.g., Eq. 6 and 7 for gradient calculation) but does not include any clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for Ada Diff, nor does it provide a link to a code repository or mention code in supplementary materials. |
| Open Datasets | Yes | To evaluate the effectiveness and generalizability of our approach, we conduct extensive experiments on three image datasets: MS COCO 2017 (Lin et al. 2014), Laion COCO (Schuhmann et al. 2022), Diffusion DB (Wang et al. 2022), and two video datasets: MSR-VTT (Xu et al. 2016) and Intern Vid (Wang et al. 2023b). |
| Dataset Splits | Yes | In MS COCO 2017, our training set consists of 118, 287 textual descriptions, and all 25, 014 text-image pairs from the validation set are employed for testing. Regarding Laion-COCO, we randomly select 200K textual descriptions for training and 20K text-image pairs for testing. The partitioning of the training and testing sets for Diffusion DB follows the same paradigm as Laion COCO. The training sets for MSR-VTT and Intern Vid consist of 6, 651 and 24, 911 text descriptions, respectively. The test set for MSR-VTT comprises 2, 870 text-video pairs. |
| Hardware Specification | Yes | The training cost ranges from 16 to 80 A100 GPU hours for the adaptive step policy applied to different base models. |
| Software Dependencies | Yes | For image generation, we use SD-v2.1-base and SDXL-Turbo to generate 512 512 images, and SDXL-v1.0 for 1024 1024 images. For video generation, we use Model Scope T2V to generate 16-frame videos at a resolution of 256 256. ... We train the step selection network ... use the Adam optimizer with an initial learning rate of 10 5. |
| Experiment Setup | Yes | We design N distinct schedulers for the DDIM sampler, each corresponding to a specific total number of sampling steps. In this paper, unless specified otherwise, we set N = 10, which corresponds to the set of step values S = {5, 10, 15, 20, 25, 30, 35, 40, 45, 50}. ... We train the step selection network for 200 epochs with a batch size of 256 and use the Adam optimizer with an initial learning rate of 10 5. ... λ is a hyperparameter that controls the effect of image quality reward Q(u) and γ is the penalty imposed on the reward function when the generated image quality is low. ... we empirically set k to 3. |