reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ABC: Achieving Better Control of Visual Embeddings using VLLMs

Authors: Benjamin Schneider, Florian Kerschbaum, Wenhu Chen

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	ABC achieves best-for-size performance on MSCOCO image-to-text retrieval and is the top performing model on classification and VQA tasks in the Massive Multimodal Embedding Benchmark. With a strongly unified vision-language representation, ABC can use natural language to solve subtle and potentially ambiguous visual retrieval problems. To evaluate this capability, we design Ctrl Bench, a benchmark that requires interleaving textual instructions with image content for correct retrieval.
Researcher Affiliation	Academia	Benjamin Schneider, University of Waterloo EMAIL Florian Kerschbaum, University of Waterloo EMAIL Wenhu Chen, University of Waterloo EMAIL
Pseudocode	No	The paper describes methods and training regimes but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions. ... Our model and datasets are available at our project page: https://tiger-ai-lab.github.io/ABC/
Open Datasets	Yes	Our model and datasets are available at our project page: https://tiger-ai-lab.github.io/ABC/ ... We use MSCOCO (Lin et al., 2015) image-to-text retrieval ... We use Visual Genome (Krishna et al., 2016) ... To create our pretraining dataset we employ negative mining on Conceptual Captions (Sharma et al., 2018). ... To construct Ctrl Bench, we sample a 1000 images from ADE20K (Zhou et al., 2017).
Dataset Splits	Yes	MSCOCO (5K test set) Flickr30K (1K test set) ... Each batch contains 128 unique images, with each image appearing four times, paired with a different instruction and a corresponding positive text candidate. ... To construct Ctrl Bench, we sample a 1000 images from ADE20K (Zhou et al., 2017). To create 5000 instructions and text candidate pairs, we generate 5 instructions for each image, each corresponding to a distinct aspect of the image.
Hardware Specification	Yes	We pretrain using batches of 512 image queries and 4096 text candidates sharded across 8 NVIDIA A100-SXM4-80GB GPUs (Qu et al., 2021) for 4000 steps.
Software Dependencies	No	The paper mentions using AdamW as an optimizer and LoRA as an adaptation technique, but it does not specify version numbers for any software libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup	Yes	We pretrain using batches of 512 image queries and 4096 text candidates ... for 4000 steps. We use a Lo RA adapter with a rank of 64 and a fixed alpha of 128. ... For our optimizer, we use Adam W (Loshchilov & Hutter, 2019) with a learning rate of 4 10 5, betas of 0.9 and 0.999 and a weight decay of 10 3. We warmup for 3% of training steps and initialize the temperature τ as 7 10 2. In our instruction fine-tuning stage, we use a lower rank Lo RA adapter. We set the rank and alpha to 16 and 32, respectively. ... we only instruction fine-tune for 100 steps. Each batch contains 128 unique images...