DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
Authors: Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, chunrui han, Zheng Ge, Xiangyu Zhang, Shu-Tao Xia
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we present DREAMBENCH++, a humanaligned benchmark that advanced multimodal GPT models automate. Specifically, we systematically design the prompts to let GPT be both human-aligned and selfaligned, empowered with task reinforcement. Further, we construct a comprehensive dataset comprising diverse images and prompts. By benchmarking 7 modern generative models, we demonstrate that DREAMBENCH++ results in significantly more human-aligned evaluation, helping boost the community with innovative findings. Table 1: Evaluation of personalized image generation models on DREAMBENCH++. Table 3: Ablation study of prompt design. Table 4: The human alignment degree among different evaluation metrics, measured by Pearson correlation value. |
| Researcher Affiliation | Collaboration | Yuang Peng1,4, Yuxin Cui1, Haomiao Tang1, Zekun Qi1 Runpei Dong2, Jing Bai3 Chunrui Han4, Zheng Ge4 Xiangyu Zhang4 Shu-Tao Xia1, 1Tsinghua University 2UIUC 3UCAS 4Step Fun |
| Pseudocode | No | The paper describes methods and processes through figures and descriptive text (e.g., Figure 3 illustrates the overall procedure of prompting GPT-4o, and Figure 4 describes the dataset construction process). However, it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code. |
| Open Source Code | Yes | We are presenting DREAMBENCH++ with open-sourced codes and evaluation standardization to promote innovation within the research community. |
| Open Datasets | No | DREAMBENCH++ primarily sources images from Unsplash (uns), Rawpixel (raw), and Google Image Search (goo), along with authorized individual contributions. After image collection, 9 text prompts per image were generated using GPT-4o... As a result, the construction process finally yielded 150 high-quality images and 1,350 prompts. |
| Dataset Splits | No | The paper describes the construction of a dataset for benchmarking purposes, comprising 150 images and 1,350 prompts. However, as this dataset is intended for evaluation rather than training models from scratch, the concept of explicit training, validation, or test splits is not applicable or provided. The entire dataset is used for evaluation, as implied by statements like 'We employ 7 human annotators to score each instance in DREAMBENCH++'. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU models, memory specifications) for running its experiments. It mentions implementing methods and tuning parameters but lacks concrete hardware details. |
| Software Dependencies | No | The paper mentions using GPT-4o and base T2I models like SD v1.5 and SDXL v1.0, but it does not provide specific version numbers for these or any other software dependencies, libraries, or programming languages used in the implementation. |
| Experiment Setup | Yes | Table 8: Training hyperparameters on Dream Bench and DREAMBENCH++. BS: batch size, LR: learning rate, Steps: training steps. During the inference stage, all methods employ a guidance_scale of 7.5 and execute 100 inference steps, with the exception of Emu2, which uses a guidance_scale of 3 and performs 50 inference steps. Furthermore, BLIP-Diffusion and IP-Adapter incorporate negative prompts, as demonstrated in Table 9. Specifically, IP-Adapter includes an additional parameter, ip_adapter_scale, set at 0.6. |