Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning

Authors: Maurits Bleeker, Mariya Hendriksen, Andrew Yates, Maarten de Rijke

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce synthetic shortcuts for vision-language: a training and evaluation framework where we inject synthetic shortcuts into image-text data. We show that contrastive VLMs trained from scratch or fine-tuned with data containing these synthetic shortcuts mainly learn features that represent the shortcut. Hence, contrastive losses are not sufficient for learning task-optimal representations... We examine two methods to reduce shortcut learning in our training and evaluation framework... We show empirically that both methods improve performance on the evaluation task...
Researcher Affiliation Academia Maurits Bleeker EMAIL University of Amsterdam, Amsterdam, The Netherlands Mariya Hendriksen EMAIL AIRLab, University of Amsterdam, Amsterdam, The Netherlands Andrew Yates EMAIL University of Amsterdam, Amsterdam, The Netherlands Maarten de Rijke EMAIL University of Amsterdam, Amsterdam, The Netherlands
Pseudocode No The paper describes mathematical definitions for loss functions (Info NCE, LTD, IFM) in Appendix E, but it does not contain any structured pseudocode or algorithm blocks with numbered steps formatted like code.
Open Source Code Yes To facilitate the reproducibility and support further research, we provide the code with our paper.1 1https://github.com/MauritsBleeker/svl-framework
Open Datasets Yes We evaluate the models performance on the Flickr30k (Young et al., 2014) and MS-COCO (Lin et al., 2014; Chen et al., 2015) and benchmarks. Flickr30k consists of 31,000 images annotated with 5 matching captions (Young et al., 2014). MS-COCO consists of 123,287 images, each image annotated with 5 matching captions (Lin et al., 2014).
Dataset Splits Yes For both datasets, we use the training, validation, and test splits from (Karpathy & Li, 2015).
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It only mentions general aspects like training models (CLIP, VSE++).
Software Dependencies No The paper mentions specific optimizers like Adam W (Loshchilov & Hutter, 2019) and Adam (Kingma & Ba, 2015), and models like Sentence-BERT (Reimers & Gurevych, 2019; Song et al., 2020), but it does not specify version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup Yes CLIP. All models are fine-tuned for 5 epochs. We employ a cosine-annealing learning rate schedule, with a base learning rate of 2e-5, and 100 steps of warm-up. As an optimizer, we use Adam W (Loshchilov & Hutter, 2019) with a gradient clipping value of 2. VSE++. The model is trained for 30 epochs using a linear learning rate schedule with a base learning rate of 2e-4. We use the Adam optimizer (Kingma & Ba, 2015) with a gradient clipping value of 2.