reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Locality Alignment Improves Vision-Language Models

Authors: Ian Covert, Tony Sun, James Y Zou, Tatsunori Hashimoto

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first evaluate locality alignment with a vision-only benchmark, finding that it improves a model s performance at patch-level semantic segmentation, especially for strong backbones trained with image-caption pairs (e.g., CLIP and Sig LIP). We then train a series of VLMs with and without locality alignment, and show that locality-aligned backbones improve performance across a range of benchmarks, particularly ones that involve spatial understanding (e.g., Ref COCO, OCID-Ref, Tally QA, VSR, AI2D). Overall, we demonstrate that we can efficiently learn local semantic extraction via a locality alignment stage, and that this procedure benefits VLM training recipes that use off-the-shelf vision backbones.
Researcher Affiliation	Academia	Ian Covert, Tony Sun, James Zou , Tatsunori Hashimoto Stanford University EMAIL
Pseudocode	No	The paper includes equations and training diagrams (Figure 1 and Figure 2), but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We provide repositories to reproduce each part of our results: Locality alignment https://github.com/iancovert/locality-alignment/ Probing benchmark https://github.com/iancovert/patch-seg/ VLM training https://github.com/iancovert/prismatic-vlms/
Open Datasets	Yes	We use Image Net-1k and Image Net-21k (hereafter IN1k and IN21k) (Deng et al., 2009) for all our experiments... We implement this approach with MSCOCO (Lin et al., 2014)... For our training dataset, we use the Llava-1.5 data mixture (Liu et al., 2024) that contains 665k examples, and which consists of synthetic instruction completions (Liu et al., 2023c), existing vision-language datasets (e.g., GQA, Text Caps; Hudson & Manning 2019; Sidorov et al. 2020) and a collection of language-only data (Share GPT, 2023).
Dataset Splits	Yes	We use the training examples from MSCOCO with semantic segmentation masks (118k images) and report results using the validation set (5k images) (Lin et al., 2014).
Hardware Specification	Yes	All training runs are performed on a single NVIDIA H100 80GB GPU. All Mask Embed runs are performed on a single node with 4 NVIDIA A100 SXM4 80GB GPUs. All VLMs are trained on a single node with 8 NVIDIA A100 SXM4 80GB GPUs.
Software Dependencies	No	The paper mentions 'Adam W' as the optimizer and refers to libraries like 'Prismatic library' and 'timm repository', but does not specify version numbers for these or other key software components (e.g., Python, PyTorch, CUDA).
Experiment Setup	Yes	Table 3: Probing benchmark hyperparameters. Hyperparameter Value Epochs 5 Batch size 32 Weight decay 0.01 ... Max learning rate 1e-3 Min learning rate 1e-4 Warmup steps 500. Table 8: Mask Embed hyperparameters. Hyperparameter Vi T-T / Vi T-S / Vi T-B Vi T-L / Vi T-SO400M Global batch size 1024 1024 Weight decay 0.01 0.01 Gradient clipping 1.0 1.0 Optimizer Adam W Adam W β1, β2 (0.9, 0.95) (0.9, 0.95) ... Table 9: VLM training hyperparameters. Hyperparameter Value Epochs 2 Global batch size 128 Max sequence length 2048 Weight decay 0.1 Gradient clipping 1.0 Optimizer Adam W β1, β2 (0.9, 0.999) Learning rate schedule Linear warmup + cosine decay Max learning rate 2e-5 Min learning rate 0 Warmup ratio 0.03