Exploring Task-Level Optimal Prompts for Visual In-Context Learning

Authors: Yan Zhu, Huan Ma, Changqing Zhang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results show that our proposed method can identify near-optimal prompts and reach the best VICL performance with a minimal cost that prior work has never achieved. To evaluate the effectiveness of our strategies, we conduct extensive experiments on various downstream tasks, such as foreground segmentation, single object detection, and colorization.
Researcher Affiliation Academia Yan Zhu1*, Huan Ma1*, Changqing Zhang1 1College of Intelligence and Computing, Tianjin University, Tianjin, China *These authors contributed equally. Corresponding to EMAIL.
Pseudocode Yes Algorithm 1: Top-K Prompt Selection Method Algorithm 2: Greedy Prompt Selection Method
Open Source Code No The paper does not contain an explicit statement about releasing code or a link to a code repository.
Open Datasets Yes Foreground Segmentation. We use the Pascal-5i (Shaban et al. 2017) dataset, which is comprised of 4 different image splits, where each split contains data from 5 categories. Single Object Detection. We use the Pascal VOC 2012 (Everingham et al. 2015) dataset, which contains images and their associated detection boxes. Colorization. We use a subset of the Image Net (Russakovsky et al. 2015) dataset, which contains data from 1000 categories.
Dataset Splits Yes In each task, the few-shot in-context examples come from the training set, with a default size of N = 16. For all experiments, we perform evaluations on a pre-trained image inpainting model, MAE-VQGAN (Bar et al. 2022), which consists of an encoder and a decoder. When the prompt has K samples, each sample is combined with the query image into a grid of 2 2 sub-images. These are then passed through the encoder to obtain K features, which are summed to get fused features and then input to the decoder for the final output. Random. Randomly select prompt combinations from the few-shot in-context examples. (The result of the random method is obtained by averaging the performance of all possible few-shot in-context examples combinations on the test set.)1 1O(2N) when N = 16, we randomly select N = 6 samples for obtaining P and use the remaining N N = 10 samples for validation set performance evaluation in the random, Oracle, and our methods.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes In each task, the few-shot in-context examples come from the training set, with a default size of N = 16. For all experiments, we perform evaluations on a pre-trained image inpainting model, MAE-VQGAN (Bar et al. 2022), which consists of an encoder and a decoder. When the prompt has K samples, each sample is combined with the query image into a grid of 2 2 sub-images. These are then passed through the encoder to obtain K features, which are summed to get fused features and then input to the decoder for the final output. We conduct experiments on different settings and report the results of three runs (with seed = {0, 1, 2}).