VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Authors: Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, Wenhu Chen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop a series of VLM2VEC models based on state-of-the-art VLMs, including Phi-3.5-V, LLa VA-1.6, and Qwen2-VL, and evaluate them on MMEB s benchmark. With Lo RA tuning, VL M2VE C achieves a 10% to 20% improvement over existing multimodal embedding models on MMEB s evaluation sets. Our findings reveal that VLMs are secretly strong embedding models.
Researcher Affiliation Collaboration 1University of Waterloo, 2Salesforce Research
Pseudocode No The paper describes the VLM2VEC framework and contrastive training mathematically in Section 3.1, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper provides a project website link (https://tiger-ai-lab.github.io/VLM2Vec/), but it does not explicitly state that source code is released there, nor does it provide a direct link to a code repository.
Open Datasets Yes We present MMEB (Massive Multimodal Embedding Benchmark), a comprehensive benchmark designed to evaluate multimodal embeddings across a diverse set of tasks. MMEB consists of 36 datasets organized into four meta-tasks: classification, visual question answering, retrieval, and visual grounding. Each task is reformulated as a ranking problem... Examples for each dataset in MMEB are provided in Tables 7, 8, 9 and 10. The diversity in MMEB makes it an ideal testbed for universal embeddings. Further details on dataset processing can be found in Section A.1.
Dataset Splits Yes MMEB is divided into 20 in-distribution datasets, which can be used for training, and 16 out-of-distribution datasets, reserved for evaluation. ... For the number of target candidates, a higher count could increase evaluation costs and hinder rapid model iteration, while a lower count might make the benchmark too simple and prone to saturation. To strike a balance between these extremes, we have chosen 1,000 candidates. ... For the 20 training datasets, we randomly select up to 100K data points.
Hardware Specification Yes All experiments were run on 8 H100 GPUs.
Software Dependencies No The paper mentions specific models and techniques used, such as Phi-3.5-V, LLa VA-1.6, Qwen2-VL, Lo RA, and Grad Cache, but it does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The temperature for the loss function is set to 0.02, with a batch size of 1,024, a maximum text length of 256 tokens, and 2K training steps. The Lo RA variant uses a rank of 8. For VLM2VEC leveraging Phi-3.5-V as the backbone, we configure the number of sub-image crops to 4. For VLM2VEC using LLa VA-1.6 and Qwen2-VL as the backbone, we resize the input images to a uniform resolution, employing two setups: a high-resolution configuration of 1344 1344 and a low-resolution configuration of 336 336.