Probing Visual Language Priors in VLMs

Authors: Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, Honglak Lee

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Vi LP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q&A pairs... Although humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT4o achieves only 66.17% on Vi LP... We demonstrate its effectiveness in LLa VAv1.5 and Cambrian. Project Page: Vi LP.
Researcher Affiliation Collaboration 1University of Michigan 2LGAI Research.
Pseudocode No No explicit pseudocode or algorithm blocks are present in the paper. The methodology is described in prose and mathematical formulations.
Open Source Code No The abstract mentions "Project Page: Vi LP." but does not provide a direct link to the source code repository for the methodology described in this paper, nor does it explicitly state that the code is being released or available in supplementary materials.
Open Datasets Yes Given a seed image from COCO (Lin et al., 2014), Text2VQA (Singh et al., 2019b), or Visual Genome (Krishna et al., 2017), VLMs are tasked with simultaneously selecting appropriate functions...
Dataset Splits No The paper mentions training models using "800k and 400k DPO pairs to fine-tune LLa VA (7B and 13B) and Cambrian-8B, respectively", but does not provide specific training, validation, or test dataset splits for their experiments.
Hardware Specification Yes The GPUs we used are 8-L40S.
Software Dependencies No The paper refers to pre-trained models like Stable Diffusion XL, Instruct-Pix2Pix, and Grounded-SAM, but does not provide specific version numbers for these or other software dependencies such as programming languages or deep learning frameworks used for implementation.
Experiment Setup Yes Batch sizes are set to 112 for LLa VA-7B, 80 for LLa VA-13B, and 8 (with 4 gradient accumulation steps) for Cambrian-8B. We employ Lo RA with a rank of 128, an alpha of 256, and a learning rate of 5e-7, training each model for 2 epochs.