TIPS: Text-Image Pretraining with Spatial awareness
Authors: Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, Dan Gnanapragasam, Mojtaba Seyedhosseini, Howard Zhou, Andre Araujo
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments are conducted on 8 tasks involving 16 datasets in total, demonstrating strong off-the-shelf performance on both dense and global understanding, for several image-only and image-text tasks. |
| Researcher Affiliation | Industry | Kevis-Kokitsi Maninis Kaifeng Chen Soham Ghosh Arjun Karpur Koert Chen Ye Xia Bingyi Cao Daniel Salz Guangxing Han Jan Dlabal Dan Gnanapragasam Mojtaba Seyedhosseini Howard Zhou André Araujo Google Deep Mind Correspondence: EMAIL |
| Pseudocode | No | The paper describes methods with mathematical formulas (e.g., Ldistill = X m softmax((pt b c)/τt) log(softmax(pm b /τs)) (1)) and block diagrams, but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' sections, nor any structured code-like procedures. |
| Open Source Code | Yes | Code and models are released at https://github.com/google-deepmind/tips. |
| Open Datasets | Yes | Our models are evaluated on a suite of 8 tasks involving 16 datasets in total... Semantic segmentation is a dense task evaluated on PASCAL VOC (Everingham et al., 2010) and ADE20k (Zhou et al., 2017) datasets... We leverage the Web LI dataset (Chen et al., 2023)... |
| Dataset Splits | Yes | Following DINOv2 (Oquab et al., 2024), we use the training sets of some of our evaluation datasets as the curated queries (details in the appendix). We also remove near-duplicate images from our dataset if they appeared in any of the evaluation datasets used in this paper. Semantic segmentation... We use a simple linear probe setup similar to (Oquab et al., 2024)... |
| Hardware Specification | Yes | We train the Vi T-B models for 70 epochs at batch size 16k, which takes 4 days on 256 TPUv3 chips. For the Vi T-g model we train for 15 epochs at batch size 16k, which takes 2 days on 512 TPUv5 chips |
| Software Dependencies | No | The paper mentions using 'Adafactor optimizer (Shazeer & Stern, 2018)' and 'Pali Gemma (Beyer et al., 2024) model for image captioning', but does not provide specific version numbers for these or any other key software components or libraries (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | We use 1 global crop at resolution 224 and M = 6 local crops at resolution 98. We train the Vi T-B models for 70 epochs at batch size 16k... For our high-res variant (TIPS-g/14 HR), we run an additional finetuning stage with global crops at resolution 448 and local crops at resolution 140, for 0.1 epochs at batch size 4k. We use only random resize crops and horizontal flips as image augmentations. Loss weight coefficients as in Sec. 3.2 are α = 1, β = 2. We use the Adafactor optimizer (Shazeer & Stern, 2018) with a learning rate schedule of linear warm-up for 1.4 epochs up to 5e-4, and then linear decay down to 0 for the remaining epochs. |