Probing Visual Language Priors in VLMs
Authors: Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, Honglak Lee
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Vi LP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q&A pairs... Although humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT4o achieves only 66.17% on Vi LP... We demonstrate its effectiveness in LLa VAv1.5 and Cambrian. Project Page: Vi LP. |
| Researcher Affiliation | Collaboration | 1University of Michigan 2LGAI Research. |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are present in the paper. The methodology is described in prose and mathematical formulations. |
| Open Source Code | No | The abstract mentions "Project Page: Vi LP." but does not provide a direct link to the source code repository for the methodology described in this paper, nor does it explicitly state that the code is being released or available in supplementary materials. |
| Open Datasets | Yes | Given a seed image from COCO (Lin et al., 2014), Text2VQA (Singh et al., 2019b), or Visual Genome (Krishna et al., 2017), VLMs are tasked with simultaneously selecting appropriate functions... |
| Dataset Splits | No | The paper mentions training models using "800k and 400k DPO pairs to fine-tune LLa VA (7B and 13B) and Cambrian-8B, respectively", but does not provide specific training, validation, or test dataset splits for their experiments. |
| Hardware Specification | Yes | The GPUs we used are 8-L40S. |
| Software Dependencies | No | The paper refers to pre-trained models like Stable Diffusion XL, Instruct-Pix2Pix, and Grounded-SAM, but does not provide specific version numbers for these or other software dependencies such as programming languages or deep learning frameworks used for implementation. |
| Experiment Setup | Yes | Batch sizes are set to 112 for LLa VA-7B, 80 for LLa VA-13B, and 8 (with 4 gradient accumulation steps) for Cambrian-8B. We employ Lo RA with a rank of 128, an alpha of 256, and a learning rate of 5e-7, training each model for 2 epochs. |