reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DiffExp: Efficient Exploration in Reward Fine-tuning for Text-to-Image Diffusion Models

Authors: Daewon Chae, June Suk Choi, Jinkyu Kim, Kimin Lee

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments to verify that Diff Exp improves both sample efficiency and generated image quality, and demonstrate this across various reward fine-tuning methods such as DDPO or Align Prop. Furthermore, we conduct analysis using more advanced prompt sets such as Draw Bench, and apply our method to the more advanced diffusion models such as SDXL, both of which result in significant performance improvements.
Researcher Affiliation	Academia	1Korea University, South Korea 2KAIST, South Korea EMAIL, w EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods using mathematical equations and descriptive text but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements or links indicating that the source code for Diff Exp will be made publicly available.
Open Datasets	Yes	First, we utilize an Aesthetic Score (Schuhmann et al. 2022), which is trained to predict the aesthetic quality of images. Following the baseline (Black et al. 2024; Prabhudesai et al. 2023), we use 45 animal names as training prompts for the aesthetic quality task. Second, in order to improve image-text alignment, we employ Pick Score (Kirstain et al. 2024), an open-source reward model trained on a large-scale human feedback dataset. Based on the baseline (Black et al. 2024), we use a total of 135 prompts for the image-text alignment task, combining 45 different animal names with 3 different activities (e.g., a monkey washing the dishes ). We provide the entire set of prompts used for training in the supplementary materials.
Dataset Splits	No	The paper mentions using "45 animal names as training prompts" and "a total of 135 prompts for the image-text alignment task", and a "novel test set of animal names" and "58 challenging prompts from Draw Bench". However, it does not provide specific details on how these prompts were split into training/validation/test sets for image-text pairs, nor does it specify percentages or sample counts for these splits beyond the prompt counts themselves.
Hardware Specification	No	The paper does not contain specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper mentions "Stable Diffusion v1.5" and "Low-Rank Adaptation (Lo RA)" but does not provide specific version numbers for these software components or any other libraries/frameworks.
Experiment Setup	Yes	As for scheduling exploration, we apply our exploration method only up to the three-fourths of the entire fine-tuning. We set wl to an extremely low value and wh to an ordinary CFG value (i.e., 5.0 or 7.5). This dynamic scheduling adaptively balances between image quality and diversity, allowing for generating diverse image samples without sacrificing overall sample quality. ... We find that sampling wprompt randomly from U(1, 1.2) every time is generally successful. ... Further, we experiment with different values of hyper-parameter, tthres, which determines how long the CFG scale should be maintained to a low value. In Figure 9 (b), we provide reward curves for variants of our models with tthres = {900, 800, 700}.