Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
Authors: Donghoon Kim, Minji Bae, Kyuhong Shim, Byonghyo Shim
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models. In our experiments, we show that VGD qualitatively and quantitatively achieves state-of-the-art (SOTA) performance, demonstrating superior interpretability, generalizability, and flexibility in text-to-image generation compared to previous soft and hard prompt inversion methods. We also show that VGD is compatible with a combination of various LLMs (i.e., LLa MA2, LLa MA3, Mistral) and image generation models (i.e., DALL-E 2, Mid Journey, Stable Diffusion 2). |
| Researcher Affiliation | Academia | Donghoon Kim1, Minji Bae1, Kyuhong Shim2 , Byonghyo Shim1 1Seoul National University 2Sungkyunkwan University EMAIL; EMAIL |
| Pseudocode | No | The paper describes the methodology, including problem formulation, approximation with CLIP score, and token-by-token generation, using descriptive text and mathematical equations. It does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present the steps in a structured, code-like format. |
| Open Source Code | No | The paper does not contain an unambiguous statement from the authors that they are releasing their code for the methodology described. It references a third-party tool's GitHub link ('1https://github.com/pharmapsychotic/clip-interrogator'), but this is not for their own implementation. |
| Open Datasets | Yes | Datasets We conduct experiments on four datasets with diverse distributions: LAION400M (Schuhmann et al., 2021; 2022), MS COCO (Lin et al., 2014), Celeb-A (Liu et al., 2015), and Lexica.art 2. Following PEZ (Wen et al., 2024), we randomly sample 100 images from each dataset and evaluate prompt inversion methods across 5 runs using different random seeds. See Appendix A.3 for more details. 2https://huggingface.co/datasets/Gustavosta/Stable-Diffusion-Prompts |
| Dataset Splits | Yes | Following PEZ (Wen et al., 2024), we randomly sample 100 images from each dataset and evaluate prompt inversion methods across 5 runs using different random seeds. See Appendix A.3 for more details. |
| Hardware Specification | Yes | We further investigate the efficiency of VGD in comparison with other baseline methods, measured on a single A100 80GB GPU. |
| Software Dependencies | Yes | For VGD generation, we used laion/CLIP-Vi T-H-14-laion2B-s32B-b79K3. For CLIP-I score evaluation, we used laion/CLIP-Vi T-g-14-laion2B-s12B-b42K4. We used Stable Diffusion stabilityai/stable-diffusion-2-15 for text-to-image model. Images are generated with the Stable Diffusion 2.1-768 model across all comparisons (Podell et al., 2024). |
| Experiment Setup | Yes | The beam width K is set to 10. The balancing hyperparameter α is set to 0.67. For VGD generation, we used laion/CLIP-Vi T-H-14-laion2B-s32B-b79K3. For CLIP-I score evaluation, we used laion/CLIP-Vi T-g-14-laion2B-s12B-b42K4. We used Stable Diffusion stabilityai/stable-diffusion-2-15 for text-to-image model. |