Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation
Authors: Yutong He, Alexander Robey, Naoki Murata, Yiding Jiang, Joshua Nathaniel Williams, George J. Pappas, Hamed Hassani, Yuki Mitsufuji, Ruslan Salakhutdinov, J Zico Kolter
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, our method shows significantly better generalizability and transferability as we achieve the best performance in almost all metrics when experimenting with closed-source models in comparison to baselines including Textual Inversion (Gal et al., 2023), PEZ (Wen et al., 2023), BLIP2 (Li et al., 2023) and CLIP-Interrogator1. Our results also indicate that PRISM consistently outperforms existing methods with respect to human-interpretability while maintaining high visual accuracy. Finally, we demonstrate that the strong human interpretability makes the prompts generated by PRISM easily editable, unlocking a wide array of creative possibilities in real life. |
| Researcher Affiliation | Collaboration | Yutong He1, Alexander Robey1,2, Naoki Murata3, Yiding Jiang1, Joshua N. Williams1, George J. Pappas2, Hamed Hassani2, Yuki Mitsufuji3,4, Ruslan Salakhutdinov1, J. Zico Kolter1,5 Carnegie Mellon University1,University of Pennsylvania2,Sony AI3,Sony Group Corporation4,Bosch Center for AI5 |
| Pseudocode | Yes | The pseudocode and an illustration are outlined in Algorithm 1 and Figure 2 respectively. |
| Open Source Code | Yes | Our code is available via the project page here. |
| Open Datasets | Yes | We use Dream Booth dataset (Ruiz et al., 2023) to quantitatively compare the performance in personalized T2I generation. We also qualitatively demonstrate the ability to represent a certain artistic style using Wikiart dataset (Tan et al., 2019). We use images from the Diffusion DB dataset (Wang et al., 2022) for the direct image inversion task. |
| Dataset Splits | No | The paper describes how images are generated for evaluation and comparison against baselines using reference images from datasets like Dream Booth, Wikiart, and Diffusion DB. It details the number of images generated per subject/template combination but does not specify traditional training/validation/test splits for these datasets themselves as input to PRISM or for evaluating its prompt generation capabilities in a split-based manner. |
| Hardware Specification | Yes | Table 8: Latency comparison between our method and the baselines on the task of Dreambooth personalization on SDXL-Turbo. All PRISM variations have budget N K = 40. ... on a single NVIDIA A6000 GPU. |
| Software Dependencies | No | The paper mentions specific models used, such as GPT-4V, SDXL-Turbo, Mistral 7B, CLIP-ViT L-14, DINO-V2-Base, and BLIP2-Flan-T5-XL, often with citations to the papers introducing them. However, it does not provide specific version numbers for ancillary software components like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries used in the implementation. |
| Experiment Setup | Yes | For personalized T2I generation, we use a maximum budget of 40 and report the quantitative results with N = 10, K = 4. For direct image inversion, we use a maximum budget of 30 and report the quantitative results with N = 6, K = 5. For SD 2.1 and SDXL-Turbo, we clip all prompt lengths to 77 due to their context length constraint. During PRISM iterations, we allow a maximum of 5 generation attempts for each stream and each iteration in case of potential run time errors related to black-box API calls. We set the maximum number of tokens generated by the prompt engineer assistant at each iteration to be 500. To simplify the implementation, we only keep a chat history length of 3 and use the length of the prompt as an approximation of the log-likelihood for the final prompt selection. |