reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contrastive Localized Language-Image Pre-Training

Authors: Hong-You Chen, Zhengfeng Lai, Haotian Zhang, Xinze Wang, Marcin Eichner, Keen You, Meng Cao, Bowen Zhang, Yinfei Yang, Zhe Gan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments across 31 evaluation tasks, including standard image-text tasks, newly constructed region-text tasks, and downstream evaluations with MLLMs, we demonstrate that CLOC significantly and consistently outperforms the CLIP counterpart.
Researcher Affiliation	Industry	Work done while at Apple. 1Apple AI/ML. Correspondence to: Zhe Gan <EMAIL>.
Pseudocode	Yes	Concretely, VESL is a pseudo-labeling pipeline with the following steps, with pseudo codes in Appendix C:
Open Source Code	No	We are working on releasing our pre-trained checkpoints and the constructed region-text annotations along with the final version to accelerate future research.
Open Datasets	Yes	Existing region-text corpus like Visual Genome (Krishna et al., 2017) contains about 108K images, and the largest noisy-labeled grounded dataset GRIT (Peng et al., 2023) features only around 20M images.
Dataset Splits	Yes	For region retrieval, we use a validation set of the GRIT dataset (Peng et al., 2023) and encode both the image regions and the region captions. ... We randomly sampled a 2K image validation set for fast evaluation.
Hardware Specification	Yes	Our large models (Vi T L/14) were trained on 1024 v5p TPUs for about 6 days.
Software Dependencies	No	The paper mentions JAX, T5, OWLv2, and implies NLTK with Python code examples, but does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	Table A. Pre-training hyper-parameters and settings for the in-house CLIP baseline and CLOC. Batch size 32768 Image size 224 224 (Vi T B/16) or 336 336 (Vi T L/14, H/14) ... Optimizer Adam W (β1 = 0.9, β2 = 0.98) Peak learning rate (LR) 0.0005 LR schedule cosine decays with linear warm-up (first 2k steps) Weight decay 0.2 Dropout rate 0.0