Test-time Alignment of Diffusion Models without Reward Over-optimization
Authors: Sunwoo Kim, Minkyu Kim, Dongmin Park
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate its effectiveness in single-reward optimization, multi-objective scenarios, and online black-box optimization. This work offers a robust solution for aligning diffusion models with diverse downstream objectives without compromising their general capabilities. ... We empirically validate DAS s effectiveness across diverse scenarios, including single-reward, multi-objective, and online black-box optimization tasks. |
| Researcher Affiliation | Collaboration | Sunwoo Kim1 Minkyu Kim2 Dongmin Park2 1 Seoul National University 2 KRAFTON EMAIL EMAIL |
| Pseudocode | Yes | The pseudo-code of the final algorithm with adaptive resampling is given in Algorithm A.1. ... Detailed pseudocode for our full DAS algorithm is included in Appendix A, with versions with adaptive resampling (Algorithm 1), adaptive tempering (Algorithm 3) and adaptation to online setting (Algorithm 5). |
| Open Source Code | Yes | Code is available at https://github.com/krafton-ai/DAS. |
| Open Datasets | Yes | For single reward tasks, we use aesthetic scores (Schuhmann et al., 2022) and human preference evaluated by Pick Score (Kirstain et al., 2023) as objectives. For fine-tuning methods, we used animals from Imagenet Deng et al. (2009) and prompts from Human Preference Dataset v2 (HPDv2) (Wu et al., 2023b) when training on aesthetic score and Pick Score respectively, like previous settings (Black et al., 2023; Clark et al., 2024). |
| Dataset Splits | No | Evaluation uses unseen prompts from the same dataset. ... We used HPDv2 prompts for training and evaluation. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU types, or cloud instance specifications) for running its experiments. |
| Software Dependencies | No | We used official Py Torch codebase of DDPO, Align Prop, TDPO with minimal change of hyperparameters from the settings in the original papers and codebases. We adapted the official Py Torch codebase of Free Do M and MPGD to incorporate with diffusers library. |
| Experiment Setup | Yes | For fine-tuning methods, we used 200 epoch and effective batch size of 256 using gradient accumulations if need for all methods. ... Across all experiment results except ablation studies, we used 100 diffusion time steps with γ = 0.008 for tempering. ... For single reward experiments, we used KL coefficient α = 0.01 for aesthetic score task and α = 0.0001 for Pick Score task considering the scale of the rewards. For multi-objective experiments and online black-box optimization, we used α = 0.005. We used 16 particles if not explicitly mentioned. ... For pre-training via conditional score matching, we used learning rate 0.001 with 1000 epochs. For DDS, we used learning rate 3e 5 with 300 epochs. We used Adam optimizer for all training or fine-tuning. |