Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences

Authors: Weijian Luo

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the experiment sections, we align both UNet-based and Di T-based one-step generators using DI++, which use the Stable Diffusion 1.5 and the Pixel Art-α as the reference diffusion processes. The resulting Di T-based one-step text-to-image model achieves a strong Aesthetic Score of 6.19 and an Image Reward of 1.24 on the COCO validation prompt dataset. It also achieves a leading Human preference Score (HPSv2.0) of 28.48, outperforming other open-sourced models such as Stable Diffusion XL, DMD2, SD-Turbo, as well as Pixel Artα. Both theoretical contributions and empirical evidence indicate that DI++ is a strong human-preference alignment approach for one-step text-to-image models.
Researcher Affiliation Academia Weijian Luo EMAIL Peking University
Pseudocode Yes Algorithm 1: Diff-Instruct++ for aligning generator model with human feedback reward. Input: prompt dataset C, generator gθ(x0|z, c), prior distribution pz, reward model r(x, c), reward scale αrew, CFG scale αcfg, reference diffusion model sref(xt|c, c), TA diffusion sψ(xt|t, c), forward diffusion p(xt|x0) (2.1), TA diffusion updates rounds KT A, time distribution π(t), diffusion model weighting λ(t), generator IKL loss weighting w(t).
Open Source Code Yes The homepage of the paper is: https://github.com/pkulwj1994/diff_instruct_pp.
Open Datasets Yes The resulting Di T-based one-step text-to-image model achieves a strong Aesthetic Score of 6.19 and an Image Reward of 1.24 on the COCO validation prompt dataset. It also achieves a leading Human preference Score (HPSv2.0) of 28.48, outperforming other open-sourced models such as Stable Diffusion XL, DMD2, SD-Turbo, as well as Pixel Artα. For the SD1.5 experiment, we use prompts from the Laion-Aesthetic dataset with an Aesthetic score higher than 6.25 (i.e., Laion-Aesthetic 625+ dataset) following the Si D-LSG. For the Pixel Art-α experiment, we use the prompts from the SAM-LLa VA-Caption-10M dataset as our prompt dataset. The SAM-LLa VA-Caption-10M dataset contains the images collected by Kirillov et al. (2023), together with text descriptions that are captioned by LLa VA model (Liu et al., 2024a).
Dataset Splits Yes For the other four scores, we pick 1k text prompts from the MSCOCO-2017 validation dataset and evaluate all models on these prompts.
Hardware Specification Yes We pre-train the one-step model on 4 Nvidia A100 GPUs for two days (4 × 48 = 192 GPU hours), with a batch size of 1024. Number of GPUs: 4 A100-40G for pre-training and 4 H800-80G for alignment.
Software Dependencies No We train the model with the Py Torch framework. All experiments in this section were conducted with bfloat16 precision, using the Pixel Art-XL-2-512x512 model version, employing the same hyperparameters. For both optimizers, we utilized Adam with a learning rate of 5e-6 and betas=[0, 0.999].
Experiment Setup Yes We set the Adam optimizer s beta parameters to be β1 = 0.0 and β2 = 0.999 for both the pre-training and alignment stages. We use a learning rate of 5e 6 for both TA diffusion and the student one-step generator. For the one-step generator model, we use the adaptive exponential moving average technique by referring to the implementation of the EDM (Karras et al., 2022). We pre-train the one-step model on 4 Nvidia A100 GPUs for two days (4 × 48 = 192 GPU hours), with a batch size of 1024. For the alignment stage, we use a fixed exponential moving average decay (EMA) rate of 0.95 for all training trials. ... More specifically, we align the generator model with five configurations with different CFG scales and reward scales: 1. no CFG and no reward: use a 1.0 CFG scale and a 0.0 reward scale; 2. no CFG and weak reward: use a 1.0 CFG scale and 1.0 reward scale; 3. strong CFG and no reward: 4.5 CFG scale and 0.0 reward scale; 4. strong CFG and weak reward: 4.5 CFG scale and 1.0 reward scale; 5. strong CFG and strong reward: 4.5 CFG scale and 10.0 reward scale.