reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

Authors: Yongshuo Zong, Ondrej Bohdal, Timothy Hospedales

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we introduce a comprehensive benchmark for multimodal in-context learning. Our VL-ICL Bench encompasses a broad spectrum of tasks that involve both images and text as inputs and outputs, and different types of challenges, from perception to reasoning and long context length. We evaluate the abilities of state-of-the-art VLLMs on this benchmark suite, revealing their diverse strengths and weaknesses, and showing that even the most advanced models, such as GPT-4, find the tasks challenging. By highlighting a range of new ICL tasks, and the associated strengths and limitations of existing models, we hope that our dataset will inspire future work on enhancing the in-context learning capabilities of VLLMs, as well as inspire new applications that leverage VLLM ICL.
Researcher Affiliation	Academia	Yongshuo Zong, Ondrej Bohdal, Timothy Hospedales University of Edinburgh * Co-first authors EMAIL
Pseudocode	No	The paper describes methods and a prompt format but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	No	The paper provides a 'Project page: https://ys-zong.github.io/VL-ICL/.' However, this is a project overview page and not explicitly stated as a source code repository, nor does the paper contain an unambiguous statement of code release for the methodology described.
Open Datasets	Yes	In this study, we introduce a comprehensive benchmark for multimodal in-context learning. Our VL-ICL Bench encompasses a broad spectrum of tasks... Project page: https://ys-zong.github.io/VL-ICL/. Fast Open Mini Image Net We use the variant of Mini Image Net few-shot object recognition (Vinyals et al., 2016) repurposed for ICL in Tsimpoukelli et al. (2021). CLEVR Count Induction In this dataset... We input CLEVR scene images (Johnson et al., 2017). Text OCR We repurpose the Text OCR dataset (Singh et al., 2021). Co BSAT We also utilize a recent text-to-image Co BSAT (Zeng et al., 2024) benchmark as part of our larger VL-ICL Bench suite.
Dataset Splits	Yes	We follow the typical protocol of the ICL community (Dong et al., 2024; Tsimpoukelli et al., 2021; Min et al., 2022) and split each dataset into train and test splits. Few-shot ICL is then performed/evaluated by sampling the support/context set from the training split, and the test/query examples from testing split. The final performance is the average of a number of such ICL episodes. Table 1 summarises the diverse capabilities tested by each VL-ICL Bench task... Train Set Test Set Size (GB). Appendix A: CLEVR: we randomly sample 800 examples from the original CLEVR dataset training scenes to use as support examples, and we select 200 examples randomly from the validation split as query examples. Appendix A: Operator Induction: we generate this dataset ourselves and we consider all single digit combinations. We use randomly selected 80 combinations of digits as the support set and 20 combinations of digits as the query set.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments. It mentions model sizes and context lengths but not the underlying physical hardware.
Software Dependencies	No	The paper does not list specific software components with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experiments.
Experiment Setup	Yes	All experiments are conducted using three different random seeds and we report the average performance. We use officially released model weights or GPT4 API and adopt greedy decoding for reproducibility. Prompt For consistency, we employ the following standard prompt format for in-context learning. [Task Description] Support Set: [Image][Question][Answer] (n-shot) Query: [Image][Question] Prediction: [Answer]. Evaluation Metrics All our experiments evaluate test accuracy as a function of the number of shots... We use three main metrics: zero-shot accuracy, peak (max.) accuracy over all shots and ICL efficiency... For text-to-image models, we employ LLa VA-Next-7B as the judge model to determine whether the generated images are correct.