AdaDiff: Adaptive Step Selection for Fast Diffusion Models

Authors: Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, Yu-Gang Jiang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline using a fixed 50 denoising steps while reducing inference time by at least 33%, going as high as 40%.
Researcher Affiliation Collaboration 1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2Shanghai Collaborative Innovation Center of Intelligent Visual Computing 3Byte Dance Inc.
Pseudocode No The paper describes the methodology using natural language and mathematical equations (e.g., Eq. 6 and 7 for gradient calculation) but does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No The paper does not contain any explicit statements about releasing source code for Ada Diff, nor does it provide a link to a code repository or mention code in supplementary materials.
Open Datasets Yes To evaluate the effectiveness and generalizability of our approach, we conduct extensive experiments on three image datasets: MS COCO 2017 (Lin et al. 2014), Laion COCO (Schuhmann et al. 2022), Diffusion DB (Wang et al. 2022), and two video datasets: MSR-VTT (Xu et al. 2016) and Intern Vid (Wang et al. 2023b).
Dataset Splits Yes In MS COCO 2017, our training set consists of 118, 287 textual descriptions, and all 25, 014 text-image pairs from the validation set are employed for testing. Regarding Laion-COCO, we randomly select 200K textual descriptions for training and 20K text-image pairs for testing. The partitioning of the training and testing sets for Diffusion DB follows the same paradigm as Laion COCO. The training sets for MSR-VTT and Intern Vid consist of 6, 651 and 24, 911 text descriptions, respectively. The test set for MSR-VTT comprises 2, 870 text-video pairs.
Hardware Specification Yes The training cost ranges from 16 to 80 A100 GPU hours for the adaptive step policy applied to different base models.
Software Dependencies Yes For image generation, we use SD-v2.1-base and SDXL-Turbo to generate 512 512 images, and SDXL-v1.0 for 1024 1024 images. For video generation, we use Model Scope T2V to generate 16-frame videos at a resolution of 256 256. ... We train the step selection network ... use the Adam optimizer with an initial learning rate of 10 5.
Experiment Setup Yes We design N distinct schedulers for the DDIM sampler, each corresponding to a specific total number of sampling steps. In this paper, unless specified otherwise, we set N = 10, which corresponds to the set of step values S = {5, 10, 15, 20, 25, 30, 35, 40, 45, 50}. ... We train the step selection network for 200 epochs with a batch size of 256 and use the Adam optimizer with an initial learning rate of 10 5. ... λ is a hyperparameter that controls the effect of image quality reward Q(u) and γ is the penalty imposed on the reward function when the generated image quality is low. ... we empirically set k to 3.