Contrastive Localized Language-Image Pre-Training
Authors: Hong-You Chen, Zhengfeng Lai, Haotian Zhang, Xinze Wang, Marcin Eichner, Keen You, Meng Cao, Bowen Zhang, Yinfei Yang, Zhe Gan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments across 31 evaluation tasks, including standard image-text tasks, newly constructed region-text tasks, and downstream evaluations with MLLMs, we demonstrate that CLOC significantly and consistently outperforms the CLIP counterpart. |
| Researcher Affiliation | Industry | Work done while at Apple. 1Apple AI/ML. Correspondence to: Zhe Gan <EMAIL>. |
| Pseudocode | Yes | Concretely, VESL is a pseudo-labeling pipeline with the following steps, with pseudo codes in Appendix C: |
| Open Source Code | No | We are working on releasing our pre-trained checkpoints and the constructed region-text annotations along with the final version to accelerate future research. |
| Open Datasets | Yes | Existing region-text corpus like Visual Genome (Krishna et al., 2017) contains about 108K images, and the largest noisy-labeled grounded dataset GRIT (Peng et al., 2023) features only around 20M images. |
| Dataset Splits | Yes | For region retrieval, we use a validation set of the GRIT dataset (Peng et al., 2023) and encode both the image regions and the region captions. ... We randomly sampled a 2K image validation set for fast evaluation. |
| Hardware Specification | Yes | Our large models (Vi T L/14) were trained on 1024 v5p TPUs for about 6 days. |
| Software Dependencies | No | The paper mentions JAX, T5, OWLv2, and implies NLTK with Python code examples, but does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | Table A. Pre-training hyper-parameters and settings for the in-house CLIP baseline and CLOC. Batch size 32768 Image size 224 224 (Vi T B/16) or 336 336 (Vi T L/14, H/14) ... Optimizer Adam W (β1 = 0.9, β2 = 0.98) Peak learning rate (LR) 0.0005 LR schedule cosine decays with linear warm-up (first 2k steps) Weight decay 0.2 Dropout rate 0.0 |