Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Authors: Maurits Bleeker, Mariya Hendriksen, Andrew Yates, Maarten de Rijke
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce synthetic shortcuts for vision-language: a training and evaluation framework where we inject synthetic shortcuts into image-text data. We show that contrastive VLMs trained from scratch or fine-tuned with data containing these synthetic shortcuts mainly learn features that represent the shortcut. Hence, contrastive losses are not sufficient for learning task-optimal representations... We examine two methods to reduce shortcut learning in our training and evaluation framework... We show empirically that both methods improve performance on the evaluation task... |
| Researcher Affiliation | Academia | Maurits Bleeker EMAIL University of Amsterdam, Amsterdam, The Netherlands Mariya Hendriksen EMAIL AIRLab, University of Amsterdam, Amsterdam, The Netherlands Andrew Yates EMAIL University of Amsterdam, Amsterdam, The Netherlands Maarten de Rijke EMAIL University of Amsterdam, Amsterdam, The Netherlands |
| Pseudocode | No | The paper describes mathematical definitions for loss functions (Info NCE, LTD, IFM) in Appendix E, but it does not contain any structured pseudocode or algorithm blocks with numbered steps formatted like code. |
| Open Source Code | Yes | To facilitate the reproducibility and support further research, we provide the code with our paper.1 1https://github.com/MauritsBleeker/svl-framework |
| Open Datasets | Yes | We evaluate the models performance on the Flickr30k (Young et al., 2014) and MS-COCO (Lin et al., 2014; Chen et al., 2015) and benchmarks. Flickr30k consists of 31,000 images annotated with 5 matching captions (Young et al., 2014). MS-COCO consists of 123,287 images, each image annotated with 5 matching captions (Lin et al., 2014). |
| Dataset Splits | Yes | For both datasets, we use the training, validation, and test splits from (Karpathy & Li, 2015). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It only mentions general aspects like training models (CLIP, VSE++). |
| Software Dependencies | No | The paper mentions specific optimizers like Adam W (Loshchilov & Hutter, 2019) and Adam (Kingma & Ba, 2015), and models like Sentence-BERT (Reimers & Gurevych, 2019; Song et al., 2020), but it does not specify version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or CUDA versions. |
| Experiment Setup | Yes | CLIP. All models are fine-tuned for 5 epochs. We employ a cosine-annealing learning rate schedule, with a base learning rate of 2e-5, and 100 steps of warm-up. As an optimizer, we use Adam W (Loshchilov & Hutter, 2019) with a gradient clipping value of 2. VSE++. The model is trained for 30 epochs using a linear learning rate schedule with a base learning rate of 2e-4. We use the Adam optimizer (Kingma & Ba, 2015) with a gradient clipping value of 2. |