Contrastive Localized Language-Image Pre-Training

Authors: Hong-You Chen, Zhengfeng Lai, Haotian Zhang, Xinze Wang, Marcin Eichner, Keen You, Meng Cao, Bowen Zhang, Yinfei Yang, Zhe Gan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments across 31 evaluation tasks, including standard image-text tasks, newly constructed region-text tasks, and downstream evaluations with MLLMs, we demonstrate that CLOC significantly and consistently outperforms the CLIP counterpart.
Researcher Affiliation Industry Work done while at Apple. 1Apple AI/ML. Correspondence to: Zhe Gan <EMAIL>.
Pseudocode Yes Concretely, VESL is a pseudo-labeling pipeline with the following steps, with pseudo codes in Appendix C:
Open Source Code No We are working on releasing our pre-trained checkpoints and the constructed region-text annotations along with the final version to accelerate future research.
Open Datasets Yes Existing region-text corpus like Visual Genome (Krishna et al., 2017) contains about 108K images, and the largest noisy-labeled grounded dataset GRIT (Peng et al., 2023) features only around 20M images.
Dataset Splits Yes For region retrieval, we use a validation set of the GRIT dataset (Peng et al., 2023) and encode both the image regions and the region captions. ... We randomly sampled a 2K image validation set for fast evaluation.
Hardware Specification Yes Our large models (Vi T L/14) were trained on 1024 v5p TPUs for about 6 days.
Software Dependencies No The paper mentions JAX, T5, OWLv2, and implies NLTK with Python code examples, but does not provide specific version numbers for any of these software components.
Experiment Setup Yes Table A. Pre-training hyper-parameters and settings for the in-house CLIP baseline and CLOC. Batch size 32768 Image size 224 224 (Vi T B/16) or 336 336 (Vi T L/14, H/14) ... Optimizer Adam W (β1 = 0.9, β2 = 0.98) Peak learning rate (LR) 0.0005 LR schedule cosine decays with linear warm-up (first 2k steps) Weight decay 0.2 Dropout rate 0.0