reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

Authors: Yutong He, Alexander Robey, Naoki Murata, Yiding Jiang, Joshua Nathaniel Williams, George J. Pappas, Hamed Hassani, Yuki Mitsufuji, Ruslan Salakhutdinov, J Zico Kolter

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, our method shows significantly better generalizability and transferability as we achieve the best performance in almost all metrics when experimenting with closed-source models in comparison to baselines including Textual Inversion (Gal et al., 2023), PEZ (Wen et al., 2023), BLIP2 (Li et al., 2023) and CLIP-Interrogator1. Our results also indicate that PRISM consistently outperforms existing methods with respect to human-interpretability while maintaining high visual accuracy. Finally, we demonstrate that the strong human interpretability makes the prompts generated by PRISM easily editable, unlocking a wide array of creative possibilities in real life.
Researcher Affiliation	Collaboration	Yutong He1, Alexander Robey1,2, Naoki Murata3, Yiding Jiang1, Joshua N. Williams1, George J. Pappas2, Hamed Hassani2, Yuki Mitsufuji3,4, Ruslan Salakhutdinov1, J. Zico Kolter1,5 Carnegie Mellon University1,University of Pennsylvania2,Sony AI3,Sony Group Corporation4,Bosch Center for AI5
Pseudocode	Yes	The pseudocode and an illustration are outlined in Algorithm 1 and Figure 2 respectively.
Open Source Code	Yes	Our code is available via the project page here.
Open Datasets	Yes	We use Dream Booth dataset (Ruiz et al., 2023) to quantitatively compare the performance in personalized T2I generation. We also qualitatively demonstrate the ability to represent a certain artistic style using Wikiart dataset (Tan et al., 2019). We use images from the Diffusion DB dataset (Wang et al., 2022) for the direct image inversion task.
Dataset Splits	No	The paper describes how images are generated for evaluation and comparison against baselines using reference images from datasets like Dream Booth, Wikiart, and Diffusion DB. It details the number of images generated per subject/template combination but does not specify traditional training/validation/test splits for these datasets themselves as input to PRISM or for evaluating its prompt generation capabilities in a split-based manner.
Hardware Specification	Yes	Table 8: Latency comparison between our method and the baselines on the task of Dreambooth personalization on SDXL-Turbo. All PRISM variations have budget N K = 40. ... on a single NVIDIA A6000 GPU.
Software Dependencies	No	The paper mentions specific models used, such as GPT-4V, SDXL-Turbo, Mistral 7B, CLIP-ViT L-14, DINO-V2-Base, and BLIP2-Flan-T5-XL, often with citations to the papers introducing them. However, it does not provide specific version numbers for ancillary software components like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries used in the implementation.
Experiment Setup	Yes	For personalized T2I generation, we use a maximum budget of 40 and report the quantitative results with N = 10, K = 4. For direct image inversion, we use a maximum budget of 30 and report the quantitative results with N = 6, K = 5. For SD 2.1 and SDXL-Turbo, we clip all prompt lengths to 77 due to their context length constraint. During PRISM iterations, we allow a maximum of 5 generation attempts for each stream and each iteration in case of potential run time errors related to black-box API calls. We set the maximum number of tokens generated by the prompt engineer assistant at each iteration to be 500. To simplify the implementation, we only keep a chat history length of 3 and use the length of the prompt as an approximation of the log-likelihood for the final prompt selection.