ABC: Achieving Better Control of Visual Embeddings using VLLMs
Authors: Benjamin Schneider, Florian Kerschbaum, Wenhu Chen
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | ABC achieves best-for-size performance on MSCOCO image-to-text retrieval and is the top performing model on classification and VQA tasks in the Massive Multimodal Embedding Benchmark. With a strongly unified vision-language representation, ABC can use natural language to solve subtle and potentially ambiguous visual retrieval problems. To evaluate this capability, we design Ctrl Bench, a benchmark that requires interleaving textual instructions with image content for correct retrieval. |
| Researcher Affiliation | Academia | Benjamin Schneider, University of Waterloo EMAIL Florian Kerschbaum, University of Waterloo EMAIL Wenhu Chen, University of Waterloo EMAIL |
| Pseudocode | No | The paper describes methods and training regimes but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions. ... Our model and datasets are available at our project page: https://tiger-ai-lab.github.io/ABC/ |
| Open Datasets | Yes | Our model and datasets are available at our project page: https://tiger-ai-lab.github.io/ABC/ ... We use MSCOCO (Lin et al., 2015) image-to-text retrieval ... We use Visual Genome (Krishna et al., 2016) ... To create our pretraining dataset we employ negative mining on Conceptual Captions (Sharma et al., 2018). ... To construct Ctrl Bench, we sample a 1000 images from ADE20K (Zhou et al., 2017). |
| Dataset Splits | Yes | MSCOCO (5K test set) Flickr30K (1K test set) ... Each batch contains 128 unique images, with each image appearing four times, paired with a different instruction and a corresponding positive text candidate. ... To construct Ctrl Bench, we sample a 1000 images from ADE20K (Zhou et al., 2017). To create 5000 instructions and text candidate pairs, we generate 5 instructions for each image, each corresponding to a distinct aspect of the image. |
| Hardware Specification | Yes | We pretrain using batches of 512 image queries and 4096 text candidates sharded across 8 NVIDIA A100-SXM4-80GB GPUs (Qu et al., 2021) for 4000 steps. |
| Software Dependencies | No | The paper mentions using AdamW as an optimizer and LoRA as an adaptation technique, but it does not specify version numbers for any software libraries (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | We pretrain using batches of 512 image queries and 4096 text candidates ... for 4000 steps. We use a Lo RA adapter with a rank of 64 and a fixed alpha of 128. ... For our optimizer, we use Adam W (Loshchilov & Hutter, 2019) with a learning rate of 4 10 5, betas of 0.9 and 0.999 and a weight decay of 10 3. We warmup for 3% of training steps and initialize the temperature τ as 7 10 2. In our instruction fine-tuning stage, we use a lower rank Lo RA adapter. We set the rank and alpha to 16 and 32, respectively. ... we only instruction fine-tune for 100 steps. Each batch contains 128 unique images... |