Efficient Distillation of Classifier-Free Guidance using Adapters

Authors: Cristian Perez Jensen, Seyedmorteza Sadat

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we show that AGD achieves comparable or superior FID to CFG across multiple architectures with only half the NFEs. Notably, our method enables the distillation of large models (≈ 2.6B parameters) on a single consumer GPU with 24 GB of VRAM, making it more accessible than previous approaches that require multiple high-end GPUs. We will publicly release the implementation of our method. Setup We evaluate AGD on class-conditional generation using 256 × 256 Di T-XL/2 (Peebles & Xie, 2023), and text-to-image generation using 768 × 768 Stable Diffusion 2.1 (SD2.1) (Rombach et al., 2022) and 1024 × 1024 Stable Diffusion XL (SDXL) (Podell et al., 2024).
Researcher Affiliation Academia Cristian Perez Jensen EMAIL ETH Zürich Seyedmorteza Sadat EMAIL ETH Zürich
Pseudocode Yes Algorithm 1 Trajectory collection for AGD. Algorithm 2 Adapter training for AGD.
Open Source Code No We will publicly release the implementation of our method.
Open Datasets Yes Image Net (Deng et al., 2009). For text-to-image models, we randomly select 500 captions from the COCO-2017 training set (Lin et al., 2014)
Dataset Splits Yes For training adapters on Di T, trajectories are sampled with guidance scales ω ∼ Unif([1, 6]), with four trajectories per class label of Image Net (Deng et al., 2009). For text-to-image models, we randomly select 500 captions from the COCO-2017 training set (Lin et al., 2014), generating a single trajectory per caption with guidance scales ω ∼ Unif([1, 12]). ... The FID scores for class-conditional models were computed using 10k generated samples and the entire Image Net training set. For text-to-image models, we used the full COCO-2017 validation set as the real data.
Hardware Specification Yes All experiments are conducted on a single RTX 4090 GPU (24 GB of VRAM).
Software Dependencies No The paper mentions the Adam optimizer and other techniques but does not specify software versions for libraries like PyTorch, TensorFlow, CUDA, etc.
Experiment Setup Yes Training is performed using the Adam optimizer (Kingma & Ba, 2014) without weight decay, where the learning rate follows a linear warm-up to 1 × 10−4 over the first 10% of steps, after which it decays via a cosine annealing schedule (Loshchilov & Hutter, 2016). For training adapters on Di T, trajectories are sampled with guidance scales ω ∼ Unif([1, 6]), with four trajectories per class label of Image Net (Deng et al., 2009). ... The Di T-XL/2 model was trained with a batch size of 64 for 5000 gradient steps, the SD2.1 model with a batch size of 8 for 5000 gradient steps, and the SDXL model with a batch size of 1 for 20000 gradient steps. These settings were selected based on the maximum batch size that fits within 24,GB of VRAM.