Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
Authors: Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E Gonzalez, trevor darrell, David Chan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as exhibit biases based on the placement of key information within the context window. |
| Researcher Affiliation | Academia | University of California, Berkeley |
| Pseudocode | No | The paper describes methods and architectural components like MIRAGE (Figure 5) and its training procedure, but does not present any explicit pseudocode blocks or algorithms. |
| Open Source Code | Yes | Our dataset, model, and code are available at: https://visual-haystacks.github.io. ... Code for MIRAGE is made publicly available under the MIT license at https://github.com/visual-haystacks/mirage, ... Code for the VHs benchmark is made publicly available under the MIT license at https://github.com/visual-haystacks/vhs_benchmark |
| Open Datasets | Yes | Our dataset, model, and code are available at: https://visual-haystacks.github.io. ... We construct the VHs dataset from the COCO dataset (Lin et al., 2014) ... We first included all publicly available MIQA training sets, including Ret VQA (Penamakuri et al., 2023), Slide VQA (Tanaka et al., 2023), and Web QA (Chang et al., 2022). |
| Dataset Splits | No | VHs consists of 1000 question-answer pairs for both singleand multi-needle settings, with an explicit small subset VHssmall consisting of 100 questions... We conducted experiments using the full VHs dataset where the haystack size was 100 images or fewer, and switched to the VHssmall subset with larger haystacks to mitigate computational costs. |
| Hardware Specification | Yes | processes up to 10k images on a single 40G A100 GPU ... Phi-3 theoretically offers a higher context capacity, it exhausted the memory of four 40GB A100 GPUs when processing 100 images. ... The instruction tuning was completed in two days using 16 A100 GPUs |
| Software Dependencies | No | The paper mentions specific LMM models and components used (e.g., LLa VA-v1.5-7B, Llama-v3.1-8B, Q-Former, CLIP, OWLv2) and provides Hugging Face links for some open-source models, but it does not specify versions for general ancillary software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA versions. |
| Experiment Setup | Yes | The instruction tuning was completed in two days using 16 A100 GPUs, with the first 60% of the training focused on passing only relevant images to the LLM. In the remaining 40%, several distractor images were added to improve robustness, following recommendations from (Zhang et al., 2024). ... we co-trained the retriever using the binary cross-entropy loss, assigning a higher weight (5.0) to positive samples to address data imbalance and prioritize recall. |