One Step Diffusion via Shortcut Models
Authors: Kevin Frans, Danijar Hafner, Sergey Levine, Pieter Abbeel
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations display that shortcut models satisfy a number of useful desiderata. On the commonly used Celeb A-HQ and Imagenet-256 benchmarks, a single shortcut model can handle many-step, few-step, and one-step generation. Accuracy is not sacrificed in fact, many-step generation quality matches those of baseline diffusion models. At the same time, shortcut models can consistently match or outperform two-stage distillation methods in the fewand one-step settings. |
| Researcher Affiliation | Academia | Kevin Frans UC Berkeley EMAIL Danijar Hafner UC Berkeley Sergey Levine UC Berkeley Pieter Abbeel UC Berkeley |
| Pseudocode | Yes | Algorithm 1 Shortcut Model Training Algorithm 2 Sampling |
| Open Source Code | Yes | We release model checkpoints and the full training code for replicating our experimental results: https://github.com/kvfrans/shortcut-models |
| Open Datasets | Yes | On the commonly used Celeb A-HQ and Imagenet-256 benchmarks, a single shortcut model can handle many-step, few-step, and one-step generation. |
| Dataset Splits | Yes | We report the FID-50k metric, as is standard in prior work. Following standard practice, FID is calculated with respect to statistics over the entire dataset, no compression is applied to the generated images, and images are resized to 299x299 with bilinear upscaling and clipped to (-1, 1). |
| Hardware Specification | Yes | All experiments are run on TPUv3 nodes, and methods are implemented in JAX. |
| Software Dependencies | No | The paper mentions that methods are implemented in JAX, but does not provide a specific version number for JAX or any other software libraries used. |
| Experiment Setup | Yes | Table 3: Hyperparameters used during training. Model architecture follows that described in Peebles & Xie (2023), specifically Di T-B unless mentioned otherwise. Batch Size 64 (Celeb A-HQ), 256 (Imagenet) Training Steps 400,000 (Celeb A-HQ) 800,000 (Imagenet) Latent Encoder sd-vae-mse-ft Latent Downsampling 8 (256x256x3 to 32x32x4) Ratio of Empirical to Bootstrap Targets 0.75 Number of Total Denoising Steps (M) 128 Classifier Free Guidance 0 (Celeb A-HQ), 1.5 (Imagenet) EMA Parameters Used For Bootstrap Targets? Yes EMA Parameters Used For Evaluation? Yes EMA Ratio 0.999 Optimizer Adam W Learning Rate 0.0001 Weight Decay 0.1 Hidden Size 768 Patch Size 2 Number of Layers 12 Attention Heads 12 MLP Hidden Size Ratio 4 |