reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

Authors: Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E Gonzalez, trevor darrell, David Chan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as exhibit biases based on the placement of key information within the context window.
Researcher Affiliation	Academia	University of California, Berkeley
Pseudocode	No	The paper describes methods and architectural components like MIRAGE (Figure 5) and its training procedure, but does not present any explicit pseudocode blocks or algorithms.
Open Source Code	Yes	Our dataset, model, and code are available at: https://visual-haystacks.github.io. ... Code for MIRAGE is made publicly available under the MIT license at https://github.com/visual-haystacks/mirage, ... Code for the VHs benchmark is made publicly available under the MIT license at https://github.com/visual-haystacks/vhs_benchmark
Open Datasets	Yes	Our dataset, model, and code are available at: https://visual-haystacks.github.io. ... We construct the VHs dataset from the COCO dataset (Lin et al., 2014) ... We first included all publicly available MIQA training sets, including Ret VQA (Penamakuri et al., 2023), Slide VQA (Tanaka et al., 2023), and Web QA (Chang et al., 2022).
Dataset Splits	No	VHs consists of 1000 question-answer pairs for both singleand multi-needle settings, with an explicit small subset VHssmall consisting of 100 questions... We conducted experiments using the full VHs dataset where the haystack size was 100 images or fewer, and switched to the VHssmall subset with larger haystacks to mitigate computational costs.
Hardware Specification	Yes	processes up to 10k images on a single 40G A100 GPU ... Phi-3 theoretically offers a higher context capacity, it exhausted the memory of four 40GB A100 GPUs when processing 100 images. ... The instruction tuning was completed in two days using 16 A100 GPUs
Software Dependencies	No	The paper mentions specific LMM models and components used (e.g., LLa VA-v1.5-7B, Llama-v3.1-8B, Q-Former, CLIP, OWLv2) and provides Hugging Face links for some open-source models, but it does not specify versions for general ancillary software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup	Yes	The instruction tuning was completed in two days using 16 A100 GPUs, with the first 60% of the training focused on passing only relevant images to the LLM. In the remaining 40%, several distractor images were added to improve robustness, following recommendations from (Zhang et al., 2024). ... we co-trained the retriever using the binary cross-entropy loss, assigning a higher weight (5.0) to positive samples to address data imbalance and prioritize recall.