reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences

Authors: Weijian Luo

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the experiment sections, we align both UNet-based and Di T-based one-step generators using DI++, which use the Stable Diffusion 1.5 and the Pixel Art-α as the reference diffusion processes. The resulting Di T-based one-step text-to-image model achieves a strong Aesthetic Score of 6.19 and an Image Reward of 1.24 on the COCO validation prompt dataset. It also achieves a leading Human preference Score (HPSv2.0) of 28.48, outperforming other open-sourced models such as Stable Diffusion XL, DMD2, SD-Turbo, as well as Pixel Artα. Both theoretical contributions and empirical evidence indicate that DI++ is a strong human-preference alignment approach for one-step text-to-image models.
Researcher Affiliation	Academia	Weijian Luo EMAIL Peking University
Pseudocode	Yes	Algorithm 1: Diff-Instruct++ for aligning generator model with human feedback reward. Input: prompt dataset C, generator gθ(x0\|z, c), prior distribution pz, reward model r(x, c), reward scale αrew, CFG scale αcfg, reference diffusion model sref(xt\|c, c), TA diffusion sψ(xt\|t, c), forward diffusion p(xt\|x0) (2.1), TA diffusion updates rounds KT A, time distribution π(t), diffusion model weighting λ(t), generator IKL loss weighting w(t).
Open Source Code	Yes	The homepage of the paper is: https://github.com/pkulwj1994/diff_instruct_pp.
Open Datasets	Yes	The resulting Di T-based one-step text-to-image model achieves a strong Aesthetic Score of 6.19 and an Image Reward of 1.24 on the COCO validation prompt dataset. It also achieves a leading Human preference Score (HPSv2.0) of 28.48, outperforming other open-sourced models such as Stable Diffusion XL, DMD2, SD-Turbo, as well as Pixel Artα. For the SD1.5 experiment, we use prompts from the Laion-Aesthetic dataset with an Aesthetic score higher than 6.25 (i.e., Laion-Aesthetic 625+ dataset) following the Si D-LSG. For the Pixel Art-α experiment, we use the prompts from the SAM-LLa VA-Caption-10M dataset as our prompt dataset. The SAM-LLa VA-Caption-10M dataset contains the images collected by Kirillov et al. (2023), together with text descriptions that are captioned by LLa VA model (Liu et al., 2024a).
Dataset Splits	Yes	For the other four scores, we pick 1k text prompts from the MSCOCO-2017 validation dataset and evaluate all models on these prompts.
Hardware Specification	Yes	We pre-train the one-step model on 4 Nvidia A100 GPUs for two days (4 × 48 = 192 GPU hours), with a batch size of 1024. Number of GPUs: 4 A100-40G for pre-training and 4 H800-80G for alignment.
Software Dependencies	No	We train the model with the Py Torch framework. All experiments in this section were conducted with bfloat16 precision, using the Pixel Art-XL-2-512x512 model version, employing the same hyperparameters. For both optimizers, we utilized Adam with a learning rate of 5e-6 and betas=[0, 0.999].
Experiment Setup	Yes	We set the Adam optimizer s beta parameters to be β1 = 0.0 and β2 = 0.999 for both the pre-training and alignment stages. We use a learning rate of 5e 6 for both TA diffusion and the student one-step generator. For the one-step generator model, we use the adaptive exponential moving average technique by referring to the implementation of the EDM (Karras et al., 2022). We pre-train the one-step model on 4 Nvidia A100 GPUs for two days (4 × 48 = 192 GPU hours), with a batch size of 1024. For the alignment stage, we use a fixed exponential moving average decay (EMA) rate of 0.95 for all training trials. ... More specifically, we align the generator model with five configurations with different CFG scales and reward scales: 1. no CFG and no reward: use a 1.0 CFG scale and a 0.0 reward scale; 2. no CFG and weak reward: use a 1.0 CFG scale and 1.0 reward scale; 3. strong CFG and no reward: 4.5 CFG scale and 0.0 reward scale; 4. strong CFG and weak reward: 4.5 CFG scale and 1.0 reward scale; 5. strong CFG and strong reward: 4.5 CFG scale and 10.0 reward scale.