Locality Alignment Improves Vision-Language Models
Authors: Ian Covert, Tony Sun, James Y Zou, Tatsunori Hashimoto
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first evaluate locality alignment with a vision-only benchmark, finding that it improves a model s performance at patch-level semantic segmentation, especially for strong backbones trained with image-caption pairs (e.g., CLIP and Sig LIP). We then train a series of VLMs with and without locality alignment, and show that locality-aligned backbones improve performance across a range of benchmarks, particularly ones that involve spatial understanding (e.g., Ref COCO, OCID-Ref, Tally QA, VSR, AI2D). Overall, we demonstrate that we can efficiently learn local semantic extraction via a locality alignment stage, and that this procedure benefits VLM training recipes that use off-the-shelf vision backbones. |
| Researcher Affiliation | Academia | Ian Covert, Tony Sun, James Zou , Tatsunori Hashimoto Stanford University EMAIL |
| Pseudocode | No | The paper includes equations and training diagrams (Figure 1 and Figure 2), but no explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide repositories to reproduce each part of our results: Locality alignment https://github.com/iancovert/locality-alignment/ Probing benchmark https://github.com/iancovert/patch-seg/ VLM training https://github.com/iancovert/prismatic-vlms/ |
| Open Datasets | Yes | We use Image Net-1k and Image Net-21k (hereafter IN1k and IN21k) (Deng et al., 2009) for all our experiments... We implement this approach with MSCOCO (Lin et al., 2014)... For our training dataset, we use the Llava-1.5 data mixture (Liu et al., 2024) that contains 665k examples, and which consists of synthetic instruction completions (Liu et al., 2023c), existing vision-language datasets (e.g., GQA, Text Caps; Hudson & Manning 2019; Sidorov et al. 2020) and a collection of language-only data (Share GPT, 2023). |
| Dataset Splits | Yes | We use the training examples from MSCOCO with semantic segmentation masks (118k images) and report results using the validation set (5k images) (Lin et al., 2014). |
| Hardware Specification | Yes | All training runs are performed on a single NVIDIA H100 80GB GPU. All Mask Embed runs are performed on a single node with 4 NVIDIA A100 SXM4 80GB GPUs. All VLMs are trained on a single node with 8 NVIDIA A100 SXM4 80GB GPUs. |
| Software Dependencies | No | The paper mentions 'Adam W' as the optimizer and refers to libraries like 'Prismatic library' and 'timm repository', but does not specify version numbers for these or other key software components (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | Table 3: Probing benchmark hyperparameters. Hyperparameter Value Epochs 5 Batch size 32 Weight decay 0.01 ... Max learning rate 1e-3 Min learning rate 1e-4 Warmup steps 500. Table 8: Mask Embed hyperparameters. Hyperparameter Vi T-T / Vi T-S / Vi T-B Vi T-L / Vi T-SO400M Global batch size 1024 1024 Weight decay 0.01 0.01 Gradient clipping 1.0 1.0 Optimizer Adam W Adam W β1, β2 (0.9, 0.95) (0.9, 0.95) ... Table 9: VLM training hyperparameters. Hyperparameter Value Epochs 2 Global batch size 128 Max sequence length 2048 Weight decay 0.1 Gradient clipping 1.0 Optimizer Adam W β1, β2 (0.9, 0.999) Learning rate schedule Linear warmup + cosine decay Max learning rate 2e-5 Min learning rate 0 Warmup ratio 0.03 |