TULIP: Token-length Upgraded CLIP
Authors: Ivona Najdenkoska, Mohammad Mahdi Derakhshani, Yuki Asano, Nanne van Noord, Marcel Worring, Cees G Snoek
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate TULIP on three downstream tasks: short caption cross-modal retrieval, long caption cross-modal retrieval, and text-to-image generation. For short caption cross-modal retrieval, we follow Zhang et al. (2024) and evaluate our model on the COCO2017 5k validation set (Lin et al., 2014) and the full Flickr30k dataset (Plummer et al., 2015). Similarly, for long caption cross-modal retrieval, we use two datasets, namely Share GPT4V test split and Urban-1K. |
| Researcher Affiliation | Academia | Ivona Najdenkoska Mohammad Mahdi Derakhshani Yuki M. Asano Nanne van Noord Marcel Worring Cees G. M. Snoek University of Amsterdam Amsterdam, the Netherlands |
| Pseudocode | No | The paper describes the method and training procedure in detail, but it does not include any explicit pseudocode blocks or algorithms labeled as such. Figures 1 and 2 are diagrams illustrating the architecture and training steps, not pseudocode. |
| Open Source Code | Yes | The code repository is available at https://github.com/ivonajdenkoska/tulip. |
| Open Datasets | Yes | In addition to our new method, we introduce a new benchmark for long captions adapted from the recently introduced Dense Captioning Images (DCI) dataset (Urbanek et al., 2024). We evaluate TULIP on three downstream tasks: short caption cross-modal retrieval, long caption cross-modal retrieval, and text-to-image generation. For short caption cross-modal retrieval, we follow Zhang et al. (2024) and evaluate our model on the COCO2017 5k validation set (Lin et al., 2014) and the full Flickr30k dataset (Plummer et al., 2015). Similarly, for long caption cross-modal retrieval, we use two datasets, namely Share GPT4V test split and Urban-1K. |
| Dataset Splits | Yes | For short caption cross-modal retrieval, we follow Zhang et al. (2024) and evaluate our model on the COCO2017 5k validation set (Lin et al., 2014) and the full Flickr30k dataset (Plummer et al., 2015). Similarly, for long caption cross-modal retrieval, we use two datasets, namely Share GPT4V test split and Urban-1K. For each dataset, we choose the best design and settings using a validation set, and then report the final results on the test set. |
| Hardware Specification | No | The paper does not explicitly state the specific hardware (e.g., GPU models, CPU types) used for running the experiments. It mentions using Open AI's pre-trained CLIP-Vi T-B-16 and CLIP-Vi T-L-14 architectures, which refers to the models themselves, not the hardware used to train or run them. |
| Software Dependencies | No | The paper mentions basing implementations on "Open AI s pre-trained CLIP-Vi T-B-16 and CLIP-Vi T-L-14 architectures (Ilharco et al., 2021)" and using the "Adam W optimizer (Loshchilov, 2017)". However, it does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages. |
| Experiment Setup | Yes | During the relative position distillation phase, we truncate captions to the first 77 tokens for both the teacher and student models. We train the student model using cosine loss as the distillation loss function for 20 epochs with a batch size of 640 using the Adam W optimizer (Loshchilov, 2017), setting the learning rate to 5e-4 with 1000 warmup steps. In the relative position expansion phase, we employ full-length captions without truncation, exposing the model to comprehensive-textual details. The full TULIP model, featuring the new distilled text encoder, is fine-tuned using the NTK approach, with α empirically set to 8.0. We perform this finetuning stage for a single epoch with a batch size of 1280, a learning rate of 1e-5, and 1000 warmup steps using Adam W. |