Distribution Backtracking Builds A Faster Convergence Trajectory for Diffusion Distillation

Authors: Shengyuan Zhang, Ling Yang, Zejian Li, An Zhao, Chenye Meng, Changyuan Yang, Guang Yang, Zhiyuan Yang, Lingyun Sun

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that Dis Back achieves faster and better convergence than the existing distillation method and achieves comparable or better generation performance, with an FID score of 1.38 on the Image Net 64 64 dataset. Experiments are conducted on different models across various datasets. We first compare the performance of Dis Back with other multi-step diffusion models and distillation methods (Sec. 5.1). Secondly, we compare the convergence speed of Dis Back with its variants without the constraint of the convergence trajectory (Sec. 5.2). Thirdly, further experiments are conducted to demonstrate Dis Back s effectiveness in mitigating the score mismatch issues (Sec. 5.3). Then, we also conduct the ablation study to show the effectiveness of introducing the convergence trajectory (Sec. 5.4).
Researcher Affiliation Collaboration Shengyuan Zhang1, Ling Yang3, Zejian Li2 , An Zhao1, Chenye Meng1, Changyuan Yang4, Guang Yang4, Zhiyuan Yang4, Lingyun Sun1 1College of Computer Science and Technology, Zhejiang University 2School of Software Technology, Zhejiang University 3Peking University 4Alibaba Group
Pseudocode Yes Algorithm 1 Degradation Recording. Input: Initial student generator G0 stu and pre-trained diffusion model sθ. Output: Degradation path checkpoints {s θi | i = 0, . . . , N} s θ sθ while not converge do x0 = G0 stu(z; η) Update θ with gradient θi Et,ϵ s θ (xt, t) x0 xt Save intermediate checkpoints s θi end while
Open Source Code Yes Our code is publicly available on https://github.com/SYZhang0805/Dis Back.
Open Datasets Yes The FFHQ (Flickr-Faces-HQ) dataset (Karras et al., 2019) is a high-resolution dataset of human face images used for face generation tasks. The AFHQv2 (Animal Faces-HQ) dataset (Choi et al., 2020) comprises 15,000 high-definition animal face images with a resolution of 512 512. The Image Net dataset (Deng et al., 2009) was established as a large-scale image dataset to facilitate the development of computer vision technologies. The LSUN (Large Scale Scene Understanding) dataset (Yu et al., 2015) is a large-scale dataset for scene understanding in visual tasks within deep learning.
Dataset Splits Yes In this paper, we use the Image Net64 dataset, a subsampled version of the Image Net dataset. The Imagenet64 dataset consists of a vast collection of images with a resolution of 64 64, containing 1,281,167 training samples, 50,000 testing samples, and 1,000 labels.
Hardware Specification Yes The training consisted of 50,000 iterations on four NVIDIA 3090 GPUs, and the batch size per GPU is set to 8. The training consisted of 10,000 iterations on one NVIDIA A100 GPU, and the batch size per GPU is set to 2.
Software Dependencies No The paper mentions optimizers (Adam, SGD, Adam W) and models (U-Nets, ResNet MLP, Stable Diffusion, LCM-LoRA) but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For experiments on FFHQ 64x64, AFHQv2 64x64, and Image Net 64x64 datasets, the pre-trained models are provided by the official release of EDM Karras et al. (2022). We use Adam optimizers to train the student generator G and sϕ, with both learning rates set to 1e 5. The training consisted of 50,000 iterations on four NVIDIA 3090 GPUs, and the batch size per GPU is set to 8. The training ratio between sϕ and G remains at 1 : 1. In the Degradation stage, we trained for 200 epochs total, saving a checkpoint every 50 epochs, resulting in a total of 5 intermediate nodes along the degradation path {s θi|i = 0, 1, 2, 3, 4}. In the Distribution Backtracking stage, when i 3, each checkpoint was trained for 1,000 steps. When i < 3, each checkpoint was trained for 10,000 steps. The remaining steps were used to distill the original teacher model s θ0.