reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models

Authors: Donghoon Kim, Minji Bae, Kyuhong Shim, Byonghyo Shim

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models. In our experiments, we show that VGD qualitatively and quantitatively achieves state-of-the-art (SOTA) performance, demonstrating superior interpretability, generalizability, and flexibility in text-to-image generation compared to previous soft and hard prompt inversion methods. We also show that VGD is compatible with a combination of various LLMs (i.e., LLa MA2, LLa MA3, Mistral) and image generation models (i.e., DALL-E 2, Mid Journey, Stable Diffusion 2).
Researcher Affiliation	Academia	Donghoon Kim1, Minji Bae1, Kyuhong Shim2 , Byonghyo Shim1 1Seoul National University 2Sungkyunkwan University EMAIL; EMAIL
Pseudocode	No	The paper describes the methodology, including problem formulation, approximation with CLIP score, and token-by-token generation, using descriptive text and mathematical equations. It does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present the steps in a structured, code-like format.
Open Source Code	No	The paper does not contain an unambiguous statement from the authors that they are releasing their code for the methodology described. It references a third-party tool's GitHub link ('1https://github.com/pharmapsychotic/clip-interrogator'), but this is not for their own implementation.
Open Datasets	Yes	Datasets We conduct experiments on four datasets with diverse distributions: LAION400M (Schuhmann et al., 2021; 2022), MS COCO (Lin et al., 2014), Celeb-A (Liu et al., 2015), and Lexica.art 2. Following PEZ (Wen et al., 2024), we randomly sample 100 images from each dataset and evaluate prompt inversion methods across 5 runs using different random seeds. See Appendix A.3 for more details. 2https://huggingface.co/datasets/Gustavosta/Stable-Diffusion-Prompts
Dataset Splits	Yes	Following PEZ (Wen et al., 2024), we randomly sample 100 images from each dataset and evaluate prompt inversion methods across 5 runs using different random seeds. See Appendix A.3 for more details.
Hardware Specification	Yes	We further investigate the efficiency of VGD in comparison with other baseline methods, measured on a single A100 80GB GPU.
Software Dependencies	Yes	For VGD generation, we used laion/CLIP-Vi T-H-14-laion2B-s32B-b79K3. For CLIP-I score evaluation, we used laion/CLIP-Vi T-g-14-laion2B-s12B-b42K4. We used Stable Diffusion stabilityai/stable-diffusion-2-15 for text-to-image model. Images are generated with the Stable Diffusion 2.1-768 model across all comparisons (Podell et al., 2024).
Experiment Setup	Yes	The beam width K is set to 10. The balancing hyperparameter α is set to 0.67. For VGD generation, we used laion/CLIP-Vi T-H-14-laion2B-s32B-b79K3. For CLIP-I score evaluation, we used laion/CLIP-Vi T-g-14-laion2B-s12B-b42K4. We used Stable Diffusion stabilityai/stable-diffusion-2-15 for text-to-image model.