Efficient Distillation of Classifier-Free Guidance using Adapters
Authors: Cristian Perez Jensen, Seyedmorteza Sadat
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we show that AGD achieves comparable or superior FID to CFG across multiple architectures with only half the NFEs. Notably, our method enables the distillation of large models (≈ 2.6B parameters) on a single consumer GPU with 24 GB of VRAM, making it more accessible than previous approaches that require multiple high-end GPUs. We will publicly release the implementation of our method. Setup We evaluate AGD on class-conditional generation using 256 × 256 Di T-XL/2 (Peebles & Xie, 2023), and text-to-image generation using 768 × 768 Stable Diffusion 2.1 (SD2.1) (Rombach et al., 2022) and 1024 × 1024 Stable Diffusion XL (SDXL) (Podell et al., 2024). |
| Researcher Affiliation | Academia | Cristian Perez Jensen EMAIL ETH Zürich Seyedmorteza Sadat EMAIL ETH Zürich |
| Pseudocode | Yes | Algorithm 1 Trajectory collection for AGD. Algorithm 2 Adapter training for AGD. |
| Open Source Code | No | We will publicly release the implementation of our method. |
| Open Datasets | Yes | Image Net (Deng et al., 2009). For text-to-image models, we randomly select 500 captions from the COCO-2017 training set (Lin et al., 2014) |
| Dataset Splits | Yes | For training adapters on Di T, trajectories are sampled with guidance scales ω ∼ Unif([1, 6]), with four trajectories per class label of Image Net (Deng et al., 2009). For text-to-image models, we randomly select 500 captions from the COCO-2017 training set (Lin et al., 2014), generating a single trajectory per caption with guidance scales ω ∼ Unif([1, 12]). ... The FID scores for class-conditional models were computed using 10k generated samples and the entire Image Net training set. For text-to-image models, we used the full COCO-2017 validation set as the real data. |
| Hardware Specification | Yes | All experiments are conducted on a single RTX 4090 GPU (24 GB of VRAM). |
| Software Dependencies | No | The paper mentions the Adam optimizer and other techniques but does not specify software versions for libraries like PyTorch, TensorFlow, CUDA, etc. |
| Experiment Setup | Yes | Training is performed using the Adam optimizer (Kingma & Ba, 2014) without weight decay, where the learning rate follows a linear warm-up to 1 × 10−4 over the first 10% of steps, after which it decays via a cosine annealing schedule (Loshchilov & Hutter, 2016). For training adapters on Di T, trajectories are sampled with guidance scales ω ∼ Unif([1, 6]), with four trajectories per class label of Image Net (Deng et al., 2009). ... The Di T-XL/2 model was trained with a batch size of 64 for 5000 gradient steps, the SD2.1 model with a batch size of 8 for 5000 gradient steps, and the SDXL model with a batch size of 1 for 20000 gradient steps. These settings were selected based on the maximum batch size that fits within 24,GB of VRAM. |