Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models

Authors: Donghoon Kim, Minji Bae, Kyuhong Shim, Byonghyo Shim

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models. In our experiments, we show that VGD qualitatively and quantitatively achieves state-of-the-art (SOTA) performance, demonstrating superior interpretability, generalizability, and flexibility in text-to-image generation compared to previous soft and hard prompt inversion methods. We also show that VGD is compatible with a combination of various LLMs (i.e., LLa MA2, LLa MA3, Mistral) and image generation models (i.e., DALL-E 2, Mid Journey, Stable Diffusion 2).
Researcher Affiliation Academia Donghoon Kim1, Minji Bae1, Kyuhong Shim2 , Byonghyo Shim1 1Seoul National University 2Sungkyunkwan University EMAIL; EMAIL
Pseudocode No The paper describes the methodology, including problem formulation, approximation with CLIP score, and token-by-token generation, using descriptive text and mathematical equations. It does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present the steps in a structured, code-like format.
Open Source Code No The paper does not contain an unambiguous statement from the authors that they are releasing their code for the methodology described. It references a third-party tool's GitHub link ('1https://github.com/pharmapsychotic/clip-interrogator'), but this is not for their own implementation.
Open Datasets Yes Datasets We conduct experiments on four datasets with diverse distributions: LAION400M (Schuhmann et al., 2021; 2022), MS COCO (Lin et al., 2014), Celeb-A (Liu et al., 2015), and Lexica.art 2. Following PEZ (Wen et al., 2024), we randomly sample 100 images from each dataset and evaluate prompt inversion methods across 5 runs using different random seeds. See Appendix A.3 for more details. 2https://huggingface.co/datasets/Gustavosta/Stable-Diffusion-Prompts
Dataset Splits Yes Following PEZ (Wen et al., 2024), we randomly sample 100 images from each dataset and evaluate prompt inversion methods across 5 runs using different random seeds. See Appendix A.3 for more details.
Hardware Specification Yes We further investigate the efficiency of VGD in comparison with other baseline methods, measured on a single A100 80GB GPU.
Software Dependencies Yes For VGD generation, we used laion/CLIP-Vi T-H-14-laion2B-s32B-b79K3. For CLIP-I score evaluation, we used laion/CLIP-Vi T-g-14-laion2B-s12B-b42K4. We used Stable Diffusion stabilityai/stable-diffusion-2-15 for text-to-image model. Images are generated with the Stable Diffusion 2.1-768 model across all comparisons (Podell et al., 2024).
Experiment Setup Yes The beam width K is set to 10. The balancing hyperparameter α is set to 0.67. For VGD generation, we used laion/CLIP-Vi T-H-14-laion2B-s32B-b79K3. For CLIP-I score evaluation, we used laion/CLIP-Vi T-g-14-laion2B-s12B-b42K4. We used Stable Diffusion stabilityai/stable-diffusion-2-15 for text-to-image model.