reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning

Authors: Maurits Bleeker, Mariya Hendriksen, Andrew Yates, Maarten de Rijke

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce synthetic shortcuts for vision-language: a training and evaluation framework where we inject synthetic shortcuts into image-text data. We show that contrastive VLMs trained from scratch or fine-tuned with data containing these synthetic shortcuts mainly learn features that represent the shortcut. Hence, contrastive losses are not sufficient for learning task-optimal representations... We examine two methods to reduce shortcut learning in our training and evaluation framework... We show empirically that both methods improve performance on the evaluation task...
Researcher Affiliation	Academia	Maurits Bleeker EMAIL University of Amsterdam, Amsterdam, The Netherlands Mariya Hendriksen EMAIL AIRLab, University of Amsterdam, Amsterdam, The Netherlands Andrew Yates EMAIL University of Amsterdam, Amsterdam, The Netherlands Maarten de Rijke EMAIL University of Amsterdam, Amsterdam, The Netherlands
Pseudocode	No	The paper describes mathematical definitions for loss functions (Info NCE, LTD, IFM) in Appendix E, but it does not contain any structured pseudocode or algorithm blocks with numbered steps formatted like code.
Open Source Code	Yes	To facilitate the reproducibility and support further research, we provide the code with our paper.1 1https://github.com/MauritsBleeker/svl-framework
Open Datasets	Yes	We evaluate the models performance on the Flickr30k (Young et al., 2014) and MS-COCO (Lin et al., 2014; Chen et al., 2015) and benchmarks. Flickr30k consists of 31,000 images annotated with 5 matching captions (Young et al., 2014). MS-COCO consists of 123,287 images, each image annotated with 5 matching captions (Lin et al., 2014).
Dataset Splits	Yes	For both datasets, we use the training, validation, and test splits from (Karpathy & Li, 2015).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It only mentions general aspects like training models (CLIP, VSE++).
Software Dependencies	No	The paper mentions specific optimizers like Adam W (Loshchilov & Hutter, 2019) and Adam (Kingma & Ba, 2015), and models like Sentence-BERT (Reimers & Gurevych, 2019; Song et al., 2020), but it does not specify version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup	Yes	CLIP. All models are fine-tuned for 5 epochs. We employ a cosine-annealing learning rate schedule, with a base learning rate of 2e-5, and 100 steps of warm-up. As an optimizer, we use Adam W (Loshchilov & Hutter, 2019) with a gradient clipping value of 2. VSE++. The model is trained for 30 epochs using a linear learning rate schedule with a base learning rate of 2e-4. We use the Adam optimizer (Kingma & Ba, 2015) with a gradient clipping value of 2.