Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

Authors: Zhen Liu, Tim Xiao, Weiyang Liu, Yoshua Bengio, Dinghuai Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Figure 6 and Table 1, we show the evolution of reward, Dream Sim diversity and FID scores of all methods with the mean curves and the corresponding standard deviations (on 3 random seeds). Our proposed residual -DB is able to achieve comparable convergence speed, measured in update steps, to that of the gradient-free baselines...
Researcher Affiliation Collaboration 1Mila, Université de Montréal 2Max Planck Institute for Intelligent Systems Tübingen 3The Chinese University of Hong Kong (Shenzhen) 4University of Tübingen 5University of Cambridge 6Microsoft Research
Pseudocode Yes A OVERALL ALGORITHM Algorithm 1 -GFlow Net Diffusion Finetuning with residual -DB
Open Source Code Yes Project page: nabla-gfn.github.io
Open Datasets Yes For the main experiments, we consider two reward functions: Aesthetic Score [28], Human Preference Score (HPSv2) [69, 70] and Image Reward [71], all of which trained on large-scale human preference datasets such as LAION-aesthetic [28] and predict the logarithm of reward values.
Dataset Splits No No specific dataset splits (training, validation, test) are explicitly provided in terms of percentages or sample counts. The paper describes aspects of data collection and usage during training, such as collecting "64 generation trajectories" and "sub-sample 10% of the transitions" for loss computation, but not a defined split of an overall dataset for reproducibility.
Hardware Specification Yes All methods are benchmarked on a single node with 8 80GB-mem A100 GPUs.
Software Dependencies No No specific software dependencies with version numbers are provided. The paper mentions using "50-step DDPM sampler [17]" and "Stable Diffusion-v1.5 [51]" as the base model, and "LoRA [21]" for finetuning, but these are models/techniques, not general software dependencies with version numbers like programming languages or libraries.
Experiment Setup Yes For all experiments with residual -DB, we set the learning rate to 1e-3 and ablate over a set of choices of reward temperature β... We set the output regularization strength λ = 2000 in Aesthetic Score experiments and λ = 5000 in HPSv2 and Image Reward experiments... For HPSv2 and Image Reward experiments, we set β to be 500000 and 10000, respectively. For each epoch, we collect 64 generation trajectories... We use the number of gradient accumulation steps to 4...