reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TULIP: Token-length Upgraded CLIP

Authors: Ivona Najdenkoska, Mohammad Mahdi Derakhshani, Yuki Asano, Nanne van Noord, Marcel Worring, Cees G Snoek

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate TULIP on three downstream tasks: short caption cross-modal retrieval, long caption cross-modal retrieval, and text-to-image generation. For short caption cross-modal retrieval, we follow Zhang et al. (2024) and evaluate our model on the COCO2017 5k validation set (Lin et al., 2014) and the full Flickr30k dataset (Plummer et al., 2015). Similarly, for long caption cross-modal retrieval, we use two datasets, namely Share GPT4V test split and Urban-1K.
Researcher Affiliation	Academia	Ivona Najdenkoska Mohammad Mahdi Derakhshani Yuki M. Asano Nanne van Noord Marcel Worring Cees G. M. Snoek University of Amsterdam Amsterdam, the Netherlands
Pseudocode	No	The paper describes the method and training procedure in detail, but it does not include any explicit pseudocode blocks or algorithms labeled as such. Figures 1 and 2 are diagrams illustrating the architecture and training steps, not pseudocode.
Open Source Code	Yes	The code repository is available at https://github.com/ivonajdenkoska/tulip.
Open Datasets	Yes	In addition to our new method, we introduce a new benchmark for long captions adapted from the recently introduced Dense Captioning Images (DCI) dataset (Urbanek et al., 2024). We evaluate TULIP on three downstream tasks: short caption cross-modal retrieval, long caption cross-modal retrieval, and text-to-image generation. For short caption cross-modal retrieval, we follow Zhang et al. (2024) and evaluate our model on the COCO2017 5k validation set (Lin et al., 2014) and the full Flickr30k dataset (Plummer et al., 2015). Similarly, for long caption cross-modal retrieval, we use two datasets, namely Share GPT4V test split and Urban-1K.
Dataset Splits	Yes	For short caption cross-modal retrieval, we follow Zhang et al. (2024) and evaluate our model on the COCO2017 5k validation set (Lin et al., 2014) and the full Flickr30k dataset (Plummer et al., 2015). Similarly, for long caption cross-modal retrieval, we use two datasets, namely Share GPT4V test split and Urban-1K. For each dataset, we choose the best design and settings using a validation set, and then report the final results on the test set.
Hardware Specification	No	The paper does not explicitly state the specific hardware (e.g., GPU models, CPU types) used for running the experiments. It mentions using Open AI's pre-trained CLIP-Vi T-B-16 and CLIP-Vi T-L-14 architectures, which refers to the models themselves, not the hardware used to train or run them.
Software Dependencies	No	The paper mentions basing implementations on "Open AI s pre-trained CLIP-Vi T-B-16 and CLIP-Vi T-L-14 architectures (Ilharco et al., 2021)" and using the "Adam W optimizer (Loshchilov, 2017)". However, it does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages.
Experiment Setup	Yes	During the relative position distillation phase, we truncate captions to the first 77 tokens for both the teacher and student models. We train the student model using cosine loss as the distillation loss function for 20 epochs with a batch size of 640 using the Adam W optimizer (Loshchilov, 2017), setting the learning rate to 5e-4 with 1000 warmup steps. In the relative position expansion phase, we employ full-length captions without truncation, exposing the model to comprehensive-textual details. The full TULIP model, featuring the new distilled text encoder, is fine-tuned using the NTK approach, with α empirically set to 8.0. We perform this finetuning stage for a single epoch with a batch size of 1280, a learning rate of 1e-5, and 1000 warmup steps using Adam W.