Test-time Alignment of Diffusion Models without Reward Over-optimization

Authors: Sunwoo Kim, Minkyu Kim, Dongmin Park

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate its effectiveness in single-reward optimization, multi-objective scenarios, and online black-box optimization. This work offers a robust solution for aligning diffusion models with diverse downstream objectives without compromising their general capabilities. ... We empirically validate DAS s effectiveness across diverse scenarios, including single-reward, multi-objective, and online black-box optimization tasks.
Researcher Affiliation Collaboration Sunwoo Kim1 Minkyu Kim2 Dongmin Park2 1 Seoul National University 2 KRAFTON EMAIL EMAIL
Pseudocode Yes The pseudo-code of the final algorithm with adaptive resampling is given in Algorithm A.1. ... Detailed pseudocode for our full DAS algorithm is included in Appendix A, with versions with adaptive resampling (Algorithm 1), adaptive tempering (Algorithm 3) and adaptation to online setting (Algorithm 5).
Open Source Code Yes Code is available at https://github.com/krafton-ai/DAS.
Open Datasets Yes For single reward tasks, we use aesthetic scores (Schuhmann et al., 2022) and human preference evaluated by Pick Score (Kirstain et al., 2023) as objectives. For fine-tuning methods, we used animals from Imagenet Deng et al. (2009) and prompts from Human Preference Dataset v2 (HPDv2) (Wu et al., 2023b) when training on aesthetic score and Pick Score respectively, like previous settings (Black et al., 2023; Clark et al., 2024).
Dataset Splits No Evaluation uses unseen prompts from the same dataset. ... We used HPDv2 prompts for training and evaluation.
Hardware Specification No The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU types, or cloud instance specifications) for running its experiments.
Software Dependencies No We used official Py Torch codebase of DDPO, Align Prop, TDPO with minimal change of hyperparameters from the settings in the original papers and codebases. We adapted the official Py Torch codebase of Free Do M and MPGD to incorporate with diffusers library.
Experiment Setup Yes For fine-tuning methods, we used 200 epoch and effective batch size of 256 using gradient accumulations if need for all methods. ... Across all experiment results except ablation studies, we used 100 diffusion time steps with γ = 0.008 for tempering. ... For single reward experiments, we used KL coefficient α = 0.01 for aesthetic score task and α = 0.0001 for Pick Score task considering the scale of the rewards. For multi-objective experiments and online black-box optimization, we used α = 0.005. We used 16 particles if not explicitly mentioned. ... For pre-training via conditional score matching, we used learning rate 0.001 with 1000 epochs. For DDS, we used learning rate 3e 5 with 300 epochs. We used Adam optimizer for all training or fine-tuning.