Exploring Task-Level Optimal Prompts for Visual In-Context Learning
Authors: Yan Zhu, Huan Ma, Changqing Zhang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results show that our proposed method can identify near-optimal prompts and reach the best VICL performance with a minimal cost that prior work has never achieved. To evaluate the effectiveness of our strategies, we conduct extensive experiments on various downstream tasks, such as foreground segmentation, single object detection, and colorization. |
| Researcher Affiliation | Academia | Yan Zhu1*, Huan Ma1*, Changqing Zhang1 1College of Intelligence and Computing, Tianjin University, Tianjin, China *These authors contributed equally. Corresponding to EMAIL. |
| Pseudocode | Yes | Algorithm 1: Top-K Prompt Selection Method Algorithm 2: Greedy Prompt Selection Method |
| Open Source Code | No | The paper does not contain an explicit statement about releasing code or a link to a code repository. |
| Open Datasets | Yes | Foreground Segmentation. We use the Pascal-5i (Shaban et al. 2017) dataset, which is comprised of 4 different image splits, where each split contains data from 5 categories. Single Object Detection. We use the Pascal VOC 2012 (Everingham et al. 2015) dataset, which contains images and their associated detection boxes. Colorization. We use a subset of the Image Net (Russakovsky et al. 2015) dataset, which contains data from 1000 categories. |
| Dataset Splits | Yes | In each task, the few-shot in-context examples come from the training set, with a default size of N = 16. For all experiments, we perform evaluations on a pre-trained image inpainting model, MAE-VQGAN (Bar et al. 2022), which consists of an encoder and a decoder. When the prompt has K samples, each sample is combined with the query image into a grid of 2 2 sub-images. These are then passed through the encoder to obtain K features, which are summed to get fused features and then input to the decoder for the final output. Random. Randomly select prompt combinations from the few-shot in-context examples. (The result of the random method is obtained by averaging the performance of all possible few-shot in-context examples combinations on the test set.)1 1O(2N) when N = 16, we randomly select N = 6 samples for obtaining P and use the remaining N N = 10 samples for validation set performance evaluation in the random, Oracle, and our methods. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | In each task, the few-shot in-context examples come from the training set, with a default size of N = 16. For all experiments, we perform evaluations on a pre-trained image inpainting model, MAE-VQGAN (Bar et al. 2022), which consists of an encoder and a decoder. When the prompt has K samples, each sample is combined with the query image into a grid of 2 2 sub-images. These are then passed through the encoder to obtain K features, which are summed to get fused features and then input to the decoder for the final output. We conduct experiments on different settings and report the results of three runs (with seed = {0, 1, 2}). |