reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

Authors: Ayano Hiranaka, Shang-Fu Chen, Chieh-Hsin Lai, Dongjun Kim, Naoki Murata, Takashi Shibuya, WeiHsiang Liao, Shao-Hua Sun, Yuki Mitsufuji

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that HERO is 4ˆ more efficient in online feedback for body part anomaly correction compared to the best existing method. Additionally, experiments show that HERO can effectively handle tasks like reasoning, counting, personalization, and reducing NSFW content with only 0.5K online feedback. The code and project page are available at https://hero-dm.github.io/. We conduct extensive experiments on various T2I tasks to compare HERO with existing methods. The experimental results show that HERO can effectively fine-tune SD to reliably follow given text prompts with 4ˆ fewer amount of human feedback compared to D3PO (Yang et al., 2024b).
Researcher Affiliation	Collaboration	1Sony AI, 2 Graduate Institute of Communication Engineering, National Taiwan University, 3University of Southern California, 4Stanford University
Pseudocode	Yes	E.1 HERO Detailed Algorithm In this section, we summarize the algorithm of HERO as presented in Algorithm 1. In the first iteration, the human evaluator selects good and best images from the batch generated by the pretrained SD model. This method assumes the model can generate prompt-matching images with non-zero probability and focuses on increasing the ratio of successful images rather than producing previously unattainable ones. Algorithm 1 HERO s Training Require: pretrained SD weights ϕ, best image ratio β, feedback budget Nfb Initialize: learnable weights θ, # of feedback nfb 0, latent distribution πHERO Npz T ; 0, Iq 1: while nfb ă Nfb do 2: Sample nbatch noise latents z T from πHERO Ź Feedback-Guided Image Generation 3: Perform denoising process for each z T to obtain trajectory tz T , z T 1, , z0u. 4: Decode Z0 with SD decoder for images X. 5: Query human feedback on X, and save corresponding Z T , Z T , zbest T . 6: Update θ of Eθ and gθ by minimizing Eq. (3). Ź Feedback-Aligned Representation Learning 7: Compute reward Rpz0q according to Eq. (4). 8: Update ϕ via DDPO by minimizing Eq. (8). 9: Update latents distribution πHERO using Eq. (5). 10: nfb nbatch. 11: end while
Open Source Code	Yes	The code and project page are available at https://hero-dm.github.io/.
Open Datasets	Yes	Diffusion-DPO (Wallace et al., 2023) applies DPO (Rafailov et al., 2023) to directly utilize preference data to fine-tune SD, eliminating the need for predefined rewards. Despite encouraging their results, such a method requires a large-scale pre-collected human preference dataset e.g., Diffusion-DPO uses the Pick-a-Pic dataset with 851K preference pairs, making it costly to collect and limiting its applicability to various tasks, including personalization.
Dataset Splits	No	In each epoch of HERO, feedback on 128 images is collected, and the human evaluator provides a total of 1152 feedback over 9 epochs. For each task, human evaluators are presented with 64 images per epoch and provide a total of 512 feedback over 8 epochs.
Hardware Specification	No	No specific hardware details (like GPU/CPU models, memory) are mentioned in the paper.
Software Dependencies	No	The paper mentions optimizers like Adam and models like SD v1.5 and GPT-4, but does not provide specific version numbers for software libraries or frameworks (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup	Yes	Table 7: HERO training parameters Embedding Network Eθp q and Classifier Head gθp q Learning rate 1e 5 Optimizer Adam (Kingma & Ba, 2015) (β1 0.9, β2 0.999, weight decay 0) Batch size 2048 Triplet margin α 0.5 SD Finetuning Learning rate 3e 4 Optimizer Adam (Kingma & Ba, 2015) (β1 0.9, β2 0.999, weight decay 1e 4) Batch size 2 Gradient accumulation steps 4 DDPO clipping parameter 1e 4 Update steps for loss computation K 5 Image Sampling Diffusion steps 50 (20 for hand) DDIM sampler parameter η 1.0 Classifier free guidance weight 5.0 Best image ratio β 0.5