reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving GFlowNets for Text-to-Image Diffusion Alignment

Authors: Dinghuai Zhang, Yizhe Zhang, Jiatao Gu, Ruixiang ZHANG, Joshua M. Susskind, Navdeep Jaitly, Shuangfei Zhai

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on Stable Diffusion and various reward specifications corroborate that our method could effectively align large-scale text-to-image diffusion models with given reward information. Our code is public available at https://github.com/apple/ml-diffusion-alignment-gflownet. 5 Experiments Experimental setups We choose Stable Diffusion v1.5 (Rombach et al., 2021) as our base generative model. For training, we use lowrank adaptation (Hu et al., 2021, Lo RA) for parameter efficient computation. As for the reward functions, we do experiments with the LAION Aesthetics predictor, a neural aesthetic score trained from human feedback to give an input image an aesthetic rating. For textimage alignment rewards, we choose Image Reward (Xu et al., 2023) and human preference score (HPSv2) (Wu et al., 2023). They are both CLIP (Radford et al., 2021)-type models, taking a text-image pair as input and output a scalar score about to what extent the image follows the text description. We also test with the (in)compressibility reward, which computes the file size if the input image is stored in hardware storage.
Researcher Affiliation	Industry	Dinghuai Zhang , Yizhe Zhang, Jiatao Gu, Ruixiang Zhang, Josh Susskind, Navdeep Jaitly, Shuangfei Zhai Apple Work done during internship at Apple MLR. Correspondence to EMAIL
Pseudocode	Yes	Algorithm 1 Diffusion alignment with GFlow Nets (DAG-DB & DAG-KL) Require: Denoising policy pθ(xt 1\|xt, t), noising policy q(xt\|xt 1), flow function Fϕ(xt, t), black-box reward function R( ) 1: repeat 2: Rollout τ = {xt}t with pθ(xt 1\|xt, t) 3: For each transition (xt, xt 1) τ: 4: if algorithm is DAG-DB then 5: # normal DB-based update 6: Update θ and ϕ with Equation 8 7: else if algorithm is DAG-KL then 8: # KL-based update 9: Update ϕ with Equation 8 10: Update θ with Equation 14 11: end if 12: until some convergence condition
Open Source Code	Yes	Our code is public available at https://github.com/apple/ml-diffusion-alignment-gflownet.
Open Datasets	Yes	Experimental setups We choose Stable Diffusion v1.5 (Rombach et al., 2021) as our base generative model. For training, we use lowrank adaptation (Hu et al., 2021, Lo RA) for parameter efficient computation. As for the reward functions, we do experiments with the LAION Aesthetics predictor, a neural aesthetic score trained from human feedback to give an input image an aesthetic rating. For textimage alignment rewards, we choose Image Reward (Xu et al., 2023) and human preference score (HPSv2) (Wu et al., 2023). They are both CLIP (Radford et al., 2021)-type models, taking a text-image pair as input and output a scalar score about to what extent the image follows the text description. We also test with the (in)compressibility reward, which computes the file size if the input image is stored in hardware storage. As for the prompt distribution, we use a set of 45 simple animal prompts from Black et al. (2023) for the Aesthetics task; we use the whole imagenet classes for the (in)compressibility task; we use the Draw Bench (Saharia et al., 2022) prompt set for the Image Reward task; we use the photo and painting prompts from the human preference dataset (HPDv2) (Wu et al., 2023) for the HPSv2 task. We also include a toy experiment on a CIFAR-10 pretrained DDPM1.
Dataset Splits	No	The paper describes prompt sets used (e.g., 45 simple animal prompts for Aesthetics, Imagenet classes for compressibility, Draw Bench for Image Reward, HPDv2 for HPSv2) and mentions using a CIFAR-10 pretrained model. However, it does not explicitly provide specific training/validation/test splits (e.g., percentages or sample counts) for the datasets or prompt sets used in its own experiments.
Hardware Specification	Yes	Regarding training hyperparameters, we follow the DDPO github repository implementation and describe them below for completeness. We use classifier-free guidance (Ho & Salimans, 2022, CFG) with guidance weight being 5. We use a 50-step DDIM schedule. We use NVIDIA 8 A100 80GB GPUs for each task, and use a batch size of 8 per single GPU.
Software Dependencies	No	The paper mentions using 'bfloat16 precision' and the 'hugging face diffusers package'. However, it does not provide specific version numbers for these software components or for the programming language used.
Experiment Setup	Yes	Regarding training hyperparameters, we follow the DDPO github repository implementation and describe them below for completeness. We use classifier-free guidance (Ho & Salimans, 2022, CFG) with guidance weight being 5. We use a 50-step DDIM schedule. We use NVIDIA 8 A100 80GB GPUs for each task, and use a batch size of 8 per single GPU. We do 4 step gradient accumulation, which makes the essential batch size to be 256. For each epoch , we sample 512 trajectories during the rollout phase and perform 8 optimization steps during the training phase. We train for 100 epochs. We use a 3 * 10^-4 learning rate for both the diffusion model and the flow function model without further tuning. We use the Adam W optimizer and gradient clip with the norm being 1. We set ϵ = 1 * 10^-4 in Equation 14. We use bfloat16 precision. The GFlow Net framework requires the reward function to be always non-negative, so we just take the exponential of the reward to be used as the GFlow Net reward. We also set the reward exponential to β = 100 (i.e., setting the distribution temperature to be 1/100). Therefore, log R( ) = βRoriginal( ). Note that in GFlow Net training practice, we only need to use the logarithm of the reward rather than the original reward value. We linearly anneal β from 0 to its maximal value in the first half of the training. We found that this almost does not change the final result but is helpful for training stability.For DAG-KL, we put the final β coefficient on the KL gradient term. We also find using a KL regularization DKL (pθ(xt−1\|xt)pθold(xt−1\|xt)) to be helpful for stability (this is also mentioned in Fan et al. (2023)). In practice, it is essentially adding a ℓ2 regularization term on the output of the U-Net after CFG between the current model and previous rollout model. We simply use a coefficient 1 on this term without further tuning.