VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Authors: Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, Wenhu Chen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop a series of VLM2VEC models based on state-of-the-art VLMs, including Phi-3.5-V, LLa VA-1.6, and Qwen2-VL, and evaluate them on MMEB s benchmark. With Lo RA tuning, VL M2VE C achieves a 10% to 20% improvement over existing multimodal embedding models on MMEB s evaluation sets. Our findings reveal that VLMs are secretly strong embedding models. |
| Researcher Affiliation | Collaboration | 1University of Waterloo, 2Salesforce Research |
| Pseudocode | No | The paper describes the VLM2VEC framework and contrastive training mathematically in Section 3.1, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a project website link (https://tiger-ai-lab.github.io/VLM2Vec/), but it does not explicitly state that source code is released there, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We present MMEB (Massive Multimodal Embedding Benchmark), a comprehensive benchmark designed to evaluate multimodal embeddings across a diverse set of tasks. MMEB consists of 36 datasets organized into four meta-tasks: classification, visual question answering, retrieval, and visual grounding. Each task is reformulated as a ranking problem... Examples for each dataset in MMEB are provided in Tables 7, 8, 9 and 10. The diversity in MMEB makes it an ideal testbed for universal embeddings. Further details on dataset processing can be found in Section A.1. |
| Dataset Splits | Yes | MMEB is divided into 20 in-distribution datasets, which can be used for training, and 16 out-of-distribution datasets, reserved for evaluation. ... For the number of target candidates, a higher count could increase evaluation costs and hinder rapid model iteration, while a lower count might make the benchmark too simple and prone to saturation. To strike a balance between these extremes, we have chosen 1,000 candidates. ... For the 20 training datasets, we randomly select up to 100K data points. |
| Hardware Specification | Yes | All experiments were run on 8 H100 GPUs. |
| Software Dependencies | No | The paper mentions specific models and techniques used, such as Phi-3.5-V, LLa VA-1.6, Qwen2-VL, Lo RA, and Grad Cache, but it does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The temperature for the loss function is set to 0.02, with a batch size of 1,024, a maximum text length of 256 tokens, and 2K training steps. The Lo RA variant uses a rank of 8. For VLM2VEC leveraging Phi-3.5-V as the backbone, we configure the number of sub-image crops to 4. For VLM2VEC using LLa VA-1.6 and Qwen2-VL as the backbone, we resize the input images to a uniform resolution, employing two setups: a high-resolution configuration of 1344 1344 and a low-resolution configuration of 336 336. |