reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Authors: Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, chunrui han, Zheng Ge, Xiangyu Zhang, Shu-Tao Xia

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we present DREAMBENCH++, a humanaligned benchmark that advanced multimodal GPT models automate. Specifically, we systematically design the prompts to let GPT be both human-aligned and selfaligned, empowered with task reinforcement. Further, we construct a comprehensive dataset comprising diverse images and prompts. By benchmarking 7 modern generative models, we demonstrate that DREAMBENCH++ results in significantly more human-aligned evaluation, helping boost the community with innovative findings. Table 1: Evaluation of personalized image generation models on DREAMBENCH++. Table 3: Ablation study of prompt design. Table 4: The human alignment degree among different evaluation metrics, measured by Pearson correlation value.
Researcher Affiliation	Collaboration	Yuang Peng1,4, Yuxin Cui1, Haomiao Tang1, Zekun Qi1 Runpei Dong2, Jing Bai3 Chunrui Han4, Zheng Ge4 Xiangyu Zhang4 Shu-Tao Xia1, 1Tsinghua University 2UIUC 3UCAS 4Step Fun
Pseudocode	No	The paper describes methods and processes through figures and descriptive text (e.g., Figure 3 illustrates the overall procedure of prompting GPT-4o, and Figure 4 describes the dataset construction process). However, it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code.
Open Source Code	Yes	We are presenting DREAMBENCH++ with open-sourced codes and evaluation standardization to promote innovation within the research community.
Open Datasets	No	DREAMBENCH++ primarily sources images from Unsplash (uns), Rawpixel (raw), and Google Image Search (goo), along with authorized individual contributions. After image collection, 9 text prompts per image were generated using GPT-4o... As a result, the construction process finally yielded 150 high-quality images and 1,350 prompts.
Dataset Splits	No	The paper describes the construction of a dataset for benchmarking purposes, comprising 150 images and 1,350 prompts. However, as this dataset is intended for evaluation rather than training models from scratch, the concept of explicit training, validation, or test splits is not applicable or provided. The entire dataset is used for evaluation, as implied by statements like 'We employ 7 human annotators to score each instance in DREAMBENCH++'.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU models, memory specifications) for running its experiments. It mentions implementing methods and tuning parameters but lacks concrete hardware details.
Software Dependencies	No	The paper mentions using GPT-4o and base T2I models like SD v1.5 and SDXL v1.0, but it does not provide specific version numbers for these or any other software dependencies, libraries, or programming languages used in the implementation.
Experiment Setup	Yes	Table 8: Training hyperparameters on Dream Bench and DREAMBENCH++. BS: batch size, LR: learning rate, Steps: training steps. During the inference stage, all methods employ a guidance_scale of 7.5 and execute 100 inference steps, with the exception of Emu2, which uses a guidance_scale of 3 and performs 50 inference steps. Furthermore, BLIP-Diffusion and IP-Adapter incorporate negative prompts, as demonstrated in Table 9. Specifically, IP-Adapter includes an additional parameter, ip_adapter_scale, set at 0.6.