Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

Authors: Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E Gonzalez, trevor darrell, David Chan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as exhibit biases based on the placement of key information within the context window.
Researcher Affiliation Academia University of California, Berkeley
Pseudocode No The paper describes methods and architectural components like MIRAGE (Figure 5) and its training procedure, but does not present any explicit pseudocode blocks or algorithms.
Open Source Code Yes Our dataset, model, and code are available at: https://visual-haystacks.github.io. ... Code for MIRAGE is made publicly available under the MIT license at https://github.com/visual-haystacks/mirage, ... Code for the VHs benchmark is made publicly available under the MIT license at https://github.com/visual-haystacks/vhs_benchmark
Open Datasets Yes Our dataset, model, and code are available at: https://visual-haystacks.github.io. ... We construct the VHs dataset from the COCO dataset (Lin et al., 2014) ... We first included all publicly available MIQA training sets, including Ret VQA (Penamakuri et al., 2023), Slide VQA (Tanaka et al., 2023), and Web QA (Chang et al., 2022).
Dataset Splits No VHs consists of 1000 question-answer pairs for both singleand multi-needle settings, with an explicit small subset VHssmall consisting of 100 questions... We conducted experiments using the full VHs dataset where the haystack size was 100 images or fewer, and switched to the VHssmall subset with larger haystacks to mitigate computational costs.
Hardware Specification Yes processes up to 10k images on a single 40G A100 GPU ... Phi-3 theoretically offers a higher context capacity, it exhausted the memory of four 40GB A100 GPUs when processing 100 images. ... The instruction tuning was completed in two days using 16 A100 GPUs
Software Dependencies No The paper mentions specific LMM models and components used (e.g., LLa VA-v1.5-7B, Llama-v3.1-8B, Q-Former, CLIP, OWLv2) and provides Hugging Face links for some open-source models, but it does not specify versions for general ancillary software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup Yes The instruction tuning was completed in two days using 16 A100 GPUs, with the first 60% of the training focused on passing only relevant images to the LLM. In the remaining 40%, several distractor images were added to improve robustness, following recommendations from (Zhang et al., 2024). ... we co-trained the retriever using the binary cross-entropy loss, assigning a higher weight (5.0) to positive samples to address data imbalance and prioritize recall.