One Step Diffusion via Shortcut Models

Authors: Kevin Frans, Danijar Hafner, Sergey Levine, Pieter Abbeel

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations display that shortcut models satisfy a number of useful desiderata. On the commonly used Celeb A-HQ and Imagenet-256 benchmarks, a single shortcut model can handle many-step, few-step, and one-step generation. Accuracy is not sacrificed in fact, many-step generation quality matches those of baseline diffusion models. At the same time, shortcut models can consistently match or outperform two-stage distillation methods in the fewand one-step settings.
Researcher Affiliation Academia Kevin Frans UC Berkeley EMAIL Danijar Hafner UC Berkeley Sergey Levine UC Berkeley Pieter Abbeel UC Berkeley
Pseudocode Yes Algorithm 1 Shortcut Model Training Algorithm 2 Sampling
Open Source Code Yes We release model checkpoints and the full training code for replicating our experimental results: https://github.com/kvfrans/shortcut-models
Open Datasets Yes On the commonly used Celeb A-HQ and Imagenet-256 benchmarks, a single shortcut model can handle many-step, few-step, and one-step generation.
Dataset Splits Yes We report the FID-50k metric, as is standard in prior work. Following standard practice, FID is calculated with respect to statistics over the entire dataset, no compression is applied to the generated images, and images are resized to 299x299 with bilinear upscaling and clipped to (-1, 1).
Hardware Specification Yes All experiments are run on TPUv3 nodes, and methods are implemented in JAX.
Software Dependencies No The paper mentions that methods are implemented in JAX, but does not provide a specific version number for JAX or any other software libraries used.
Experiment Setup Yes Table 3: Hyperparameters used during training. Model architecture follows that described in Peebles & Xie (2023), specifically Di T-B unless mentioned otherwise. Batch Size 64 (Celeb A-HQ), 256 (Imagenet) Training Steps 400,000 (Celeb A-HQ) 800,000 (Imagenet) Latent Encoder sd-vae-mse-ft Latent Downsampling 8 (256x256x3 to 32x32x4) Ratio of Empirical to Bootstrap Targets 0.75 Number of Total Denoising Steps (M) 128 Classifier Free Guidance 0 (Celeb A-HQ), 1.5 (Imagenet) EMA Parameters Used For Bootstrap Targets? Yes EMA Parameters Used For Evaluation? Yes EMA Ratio 0.999 Optimizer Adam W Learning Rate 0.0001 Weight Decay 0.1 Hidden Size 768 Patch Size 2 Number of Layers 12 Attention Heads 12 MLP Hidden Size Ratio 4