reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Authors: Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, Wenhu Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We develop a series of VLM2VEC models based on state-of-the-art VLMs, including Phi-3.5-V, LLa VA-1.6, and Qwen2-VL, and evaluate them on MMEB s benchmark. With Lo RA tuning, VL M2VE C achieves a 10% to 20% improvement over existing multimodal embedding models on MMEB s evaluation sets. Our findings reveal that VLMs are secretly strong embedding models.
Researcher Affiliation	Collaboration	1University of Waterloo, 2Salesforce Research
Pseudocode	No	The paper describes the VLM2VEC framework and contrastive training mathematically in Section 3.1, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper provides a project website link (https://tiger-ai-lab.github.io/VLM2Vec/), but it does not explicitly state that source code is released there, nor does it provide a direct link to a code repository.
Open Datasets	Yes	We present MMEB (Massive Multimodal Embedding Benchmark), a comprehensive benchmark designed to evaluate multimodal embeddings across a diverse set of tasks. MMEB consists of 36 datasets organized into four meta-tasks: classification, visual question answering, retrieval, and visual grounding. Each task is reformulated as a ranking problem... Examples for each dataset in MMEB are provided in Tables 7, 8, 9 and 10. The diversity in MMEB makes it an ideal testbed for universal embeddings. Further details on dataset processing can be found in Section A.1.
Dataset Splits	Yes	MMEB is divided into 20 in-distribution datasets, which can be used for training, and 16 out-of-distribution datasets, reserved for evaluation. ... For the number of target candidates, a higher count could increase evaluation costs and hinder rapid model iteration, while a lower count might make the benchmark too simple and prone to saturation. To strike a balance between these extremes, we have chosen 1,000 candidates. ... For the 20 training datasets, we randomly select up to 100K data points.
Hardware Specification	Yes	All experiments were run on 8 H100 GPUs.
Software Dependencies	No	The paper mentions specific models and techniques used, such as Phi-3.5-V, LLa VA-1.6, Qwen2-VL, Lo RA, and Grad Cache, but it does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	The temperature for the loss function is set to 0.02, with a batch size of 1,024, a maximum text length of 256 tokens, and 2K training steps. The Lo RA variant uses a rank of 8. For VLM2VEC leveraging Phi-3.5-V as the backbone, we configure the number of sub-image crops to 4. For VLM2VEC using LLa VA-1.6 and Qwen2-VL as the backbone, we resize the input images to a uniform resolution, employing two setups: a high-resolution configuration of 1344 1344 and a low-resolution configuration of 336 336.