Diff-Prompt: Diffusion-driven Prompt Generator with Mask Supervision

Authors: Weicai Yan, Wang Lin, Zirun Guo, Ye Wang, Fangming Feng, Xiaoda Yang, zehan wang, Tao Jin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on a complex pixel-level downstream task, referring expression comprehension, and compare our method with various parameter-efficient fine-tuning approaches. Diff-Prompt achieves a maximum improvement of 8.87 in R@1 and 14.05 in R@5 compared to the foundation model and also outperforms other state-of-the-art methods across multiple metrics. The experimental results validate the effectiveness of our approach and highlight the potential of using generative models for prompt generation.
Researcher Affiliation Academia Weicai Yan, Wang Lin, Zirun Guo, Ye Wang, Fangming Feng, Xiaoda Yang, Zehan Wang, Tao Jin Zhejiang University EMAIL EMAIL
Pseudocode No The paper describes the methodology in text and through architectural diagrams, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/Kelvin-ywc/diff-prompt.
Open Datasets Yes We conducted experiments on two vision-language understanding datasets, Ref COCO (Kazemzadeh et al., 2014) and Flickr30k (Plummer et al., 2016). ... We selected 11 representative datasets from ODin W (Li et al., 2022b) for testing: American Sign Language Letters (Letters), BCCD, brackish Underwater (Underwater), Cottontail Rabbits (Rabbits), North America Mushrooms (Mushrooms), Packages, pistols, Raccoon, Shellfish Open Images (Shellfish), thermal Dogs And People (Dogs People), and Vehicles Open Images (Vehicles). ... We select Image Net (Deng et al., 2009) and its four variations: Image Net-A (Hendrycks et al., 2021b), Image Net-V2 (Recht et al., 2019), Image Net R (Hendrycks et al., 2021a) and Image Net-S (Wang et al., 2019), as the evaluation datasets.
Dataset Splits Yes Dataset. We conducted experiments on two vision-language understanding datasets, Ref COCO (Kazemzadeh et al., 2014) and Flickr30k (Plummer et al., 2016). Ref COCO includes a training set, two test sets (test A and test B), and a validation set (val). Test A contains multiple people, while test B contains multiple non-human objects. The Flickr30k dataset includes the train, test, and val set.
Hardware Specification No The paper discusses computational complexity in GFLOPs and inference time, but it does not specify any particular GPU or CPU models, or other hardware components used for running the experiments.
Software Dependencies No We use the Autoencoder KL class from the Python diffusers library to train our Mask-VAE, setting the in channel parameter to 1. The paper mentions a specific library, 'Python diffusers library', but does not provide its version number or any other software dependencies with specific version numbers.
Experiment Setup Yes Experiment Detail. For the Diff-Prompt, in the first stage, we train Mask-VAE on the Ref COCO dataset for 200 epochs, setting the batch size to 128, the learning rate to 0.05, and λ to 0.0003. In the second stage, we train the prompt generator. During the training phase, we set Tforward = 100 and use squaredcos cap v2 as the noise scheduler. In the sampling phase, we use DDIM and set the number of sampling timesteps Tsample to 25, with the batch size set to 128 and the number of epochs to 100. In the third stage, for the input of the ith attention layer, we select the latent features at step 25 2i as the generated prompts. The visual embedding size is set to 96, and the language embedding size is set to 768. The learning rate is set to 0.0001, and Adam W is used as the optimizer.