reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TIPS: Text-Image Pretraining with Spatial awareness

Authors: Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, Dan Gnanapragasam, Mojtaba Seyedhosseini, Howard Zhou, Andre Araujo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments are conducted on 8 tasks involving 16 datasets in total, demonstrating strong off-the-shelf performance on both dense and global understanding, for several image-only and image-text tasks.
Researcher Affiliation	Industry	Kevis-Kokitsi Maninis Kaifeng Chen Soham Ghosh Arjun Karpur Koert Chen Ye Xia Bingyi Cao Daniel Salz Guangxing Han Jan Dlabal Dan Gnanapragasam Mojtaba Seyedhosseini Howard Zhou André Araujo Google Deep Mind Correspondence: EMAIL
Pseudocode	No	The paper describes methods with mathematical formulas (e.g., Ldistill = X m softmax((pt b c)/τt) log(softmax(pm b /τs)) (1)) and block diagrams, but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' sections, nor any structured code-like procedures.
Open Source Code	Yes	Code and models are released at https://github.com/google-deepmind/tips.
Open Datasets	Yes	Our models are evaluated on a suite of 8 tasks involving 16 datasets in total... Semantic segmentation is a dense task evaluated on PASCAL VOC (Everingham et al., 2010) and ADE20k (Zhou et al., 2017) datasets... We leverage the Web LI dataset (Chen et al., 2023)...
Dataset Splits	Yes	Following DINOv2 (Oquab et al., 2024), we use the training sets of some of our evaluation datasets as the curated queries (details in the appendix). We also remove near-duplicate images from our dataset if they appeared in any of the evaluation datasets used in this paper. Semantic segmentation... We use a simple linear probe setup similar to (Oquab et al., 2024)...
Hardware Specification	Yes	We train the Vi T-B models for 70 epochs at batch size 16k, which takes 4 days on 256 TPUv3 chips. For the Vi T-g model we train for 15 epochs at batch size 16k, which takes 2 days on 512 TPUv5 chips
Software Dependencies	No	The paper mentions using 'Adafactor optimizer (Shazeer & Stern, 2018)' and 'Pali Gemma (Beyer et al., 2024) model for image captioning', but does not provide specific version numbers for these or any other key software components or libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup	Yes	We use 1 global crop at resolution 224 and M = 6 local crops at resolution 98. We train the Vi T-B models for 70 epochs at batch size 16k... For our high-res variant (TIPS-g/14 HR), we run an additional ﬁnetuning stage with global crops at resolution 448 and local crops at resolution 140, for 0.1 epochs at batch size 4k. We use only random resize crops and horizontal ﬂips as image augmentations. Loss weight coefﬁcients as in Sec. 3.2 are α = 1, β = 2. We use the Adafactor optimizer (Shazeer & Stern, 2018) with a learning rate schedule of linear warm-up for 1.4 epochs up to 5e-4, and then linear decay down to 0 for the remaining epochs.