reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploring Task-Level Optimal Prompts for Visual In-Context Learning

Authors: Yan Zhu, Huan Ma, Changqing Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results show that our proposed method can identify near-optimal prompts and reach the best VICL performance with a minimal cost that prior work has never achieved. To evaluate the effectiveness of our strategies, we conduct extensive experiments on various downstream tasks, such as foreground segmentation, single object detection, and colorization.
Researcher Affiliation	Academia	Yan Zhu1, Huan Ma1, Changqing Zhang1 1College of Intelligence and Computing, Tianjin University, Tianjin, China *These authors contributed equally. Corresponding to EMAIL.
Pseudocode	Yes	Algorithm 1: Top-K Prompt Selection Method Algorithm 2: Greedy Prompt Selection Method
Open Source Code	No	The paper does not contain an explicit statement about releasing code or a link to a code repository.
Open Datasets	Yes	Foreground Segmentation. We use the Pascal-5i (Shaban et al. 2017) dataset, which is comprised of 4 different image splits, where each split contains data from 5 categories. Single Object Detection. We use the Pascal VOC 2012 (Everingham et al. 2015) dataset, which contains images and their associated detection boxes. Colorization. We use a subset of the Image Net (Russakovsky et al. 2015) dataset, which contains data from 1000 categories.
Dataset Splits	Yes	In each task, the few-shot in-context examples come from the training set, with a default size of N = 16. For all experiments, we perform evaluations on a pre-trained image inpainting model, MAE-VQGAN (Bar et al. 2022), which consists of an encoder and a decoder. When the prompt has K samples, each sample is combined with the query image into a grid of 2 2 sub-images. These are then passed through the encoder to obtain K features, which are summed to get fused features and then input to the decoder for the final output. Random. Randomly select prompt combinations from the few-shot in-context examples. (The result of the random method is obtained by averaging the performance of all possible few-shot in-context examples combinations on the test set.)1 1O(2N) when N = 16, we randomly select N = 6 samples for obtaining P and use the remaining N N = 10 samples for validation set performance evaluation in the random, Oracle, and our methods.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	In each task, the few-shot in-context examples come from the training set, with a default size of N = 16. For all experiments, we perform evaluations on a pre-trained image inpainting model, MAE-VQGAN (Bar et al. 2022), which consists of an encoder and a decoder. When the prompt has K samples, each sample is combined with the query image into a grid of 2 2 sub-images. These are then passed through the encoder to obtain K features, which are summed to get fused features and then input to the decoder for the final output. We conduct experiments on different settings and report the results of three runs (with seed = {0, 1, 2}).