Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets
Authors: Zhen Liu, Tim Xiao, Weiyang Liu, Yoshua Bengio, Dinghuai Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Figure 6 and Table 1, we show the evolution of reward, Dream Sim diversity and FID scores of all methods with the mean curves and the corresponding standard deviations (on 3 random seeds). Our proposed residual -DB is able to achieve comparable convergence speed, measured in update steps, to that of the gradient-free baselines... |
| Researcher Affiliation | Collaboration | 1Mila, Université de Montréal 2Max Planck Institute for Intelligent Systems Tübingen 3The Chinese University of Hong Kong (Shenzhen) 4University of Tübingen 5University of Cambridge 6Microsoft Research |
| Pseudocode | Yes | A OVERALL ALGORITHM Algorithm 1 -GFlow Net Diffusion Finetuning with residual -DB |
| Open Source Code | Yes | Project page: nabla-gfn.github.io |
| Open Datasets | Yes | For the main experiments, we consider two reward functions: Aesthetic Score [28], Human Preference Score (HPSv2) [69, 70] and Image Reward [71], all of which trained on large-scale human preference datasets such as LAION-aesthetic [28] and predict the logarithm of reward values. |
| Dataset Splits | No | No specific dataset splits (training, validation, test) are explicitly provided in terms of percentages or sample counts. The paper describes aspects of data collection and usage during training, such as collecting "64 generation trajectories" and "sub-sample 10% of the transitions" for loss computation, but not a defined split of an overall dataset for reproducibility. |
| Hardware Specification | Yes | All methods are benchmarked on a single node with 8 80GB-mem A100 GPUs. |
| Software Dependencies | No | No specific software dependencies with version numbers are provided. The paper mentions using "50-step DDPM sampler [17]" and "Stable Diffusion-v1.5 [51]" as the base model, and "LoRA [21]" for finetuning, but these are models/techniques, not general software dependencies with version numbers like programming languages or libraries. |
| Experiment Setup | Yes | For all experiments with residual -DB, we set the learning rate to 1e-3 and ablate over a set of choices of reward temperature β... We set the output regularization strength λ = 2000 in Aesthetic Score experiments and λ = 5000 in HPSv2 and Image Reward experiments... For HPSv2 and Image Reward experiments, we set β to be 500000 and 10000, respectively. For each epoch, we collect 64 generation trajectories... We use the number of gradient accumulation steps to 4... |