Unifying Specialized Visual Encoders for Video Language Models
Authors: Jihoon Chung, Tyler Zhu, Max Gonzalez Saez-Diez, Juan Carlos Niebles, Honglu Zhou, Olga Russakovsky
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Under fair comparison, MERV achieves up to 4.62% higher accuracy than its base model, while introducing minimal extra parameters and training faster than equivalent singleencoder methods after parallelizing visual processing. Qualitative analysis shows MERV successfully captures and integrates domain knowledge from each encoder, opening new possibilities for scaling enhanced video understanding. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, Princeton University, Princeton, NJ, United States 2Salesforce Research, Palo Alto, CA, United States. Correspondence to: Jihoon Chung, Tyler Zhu <EMAIL>. |
| Pseudocode | No | The paper describes its methodology using textual descriptions and mathematical equations (e.g., equations 1 and 2 in section 3.2), but it does not contain explicit pseudocode blocks or algorithms labeled as such. |
| Open Source Code | Yes | 1Our code and pretrained weights are available at https://github.com/princetonvisualai/merv. |
| Open Datasets | Yes | Our data mix is the same as Video-LLa VA (Lin et al., 2024). The Stage 1 data is single-turn concise captioning, with 558k (image, text) pairs from LAION filtered by LLa VA (Liu et al., 2023) and 702k (video, text) pairs from Valley (Luo et al., 2023). The Stage 2 data is multi-turn conversations, detailed captioning and reasoning, with 665k (image, text) pairs from LLa VA and 100k (video, text) instructions from Video-Chat GPT (Maaz et al., 2024). We evaluate our model on a comprehensive suite of video understanding benchmarks, including the open-ended MSVD-QA (Xu et al., 2017), MSRVTT-QA (Xu et al., 2017), TGIF (Jang et al., 2017), and Activity Net-QA (Yu et al., 2019), as well as the multiple-choice benchmarks NEx T-QA (Xiao et al., 2021), VLEP (Lei et al., 2020), TVQA (Lei et al., 2018), and Perception Test (P atr aucean et al., 2023). |
| Dataset Splits | Yes | For fair comparison, our data mix is the same as Video-LLa VA (Lin et al., 2024). The Stage 1 data is single-turn concise captioning, with 558k (image, text) pairs from LAION filtered by LLa VA (Liu et al., 2023) and 702k (video, text) pairs from Valley (Luo et al., 2023). The Stage 2 data is multi-turn conversations, detailed captioning and reasoning, with 665k (image, text) pairs from LLa VA and 100k (video, text) instructions from Video-Chat GPT (Maaz et al., 2024). We emphasize that NEx T-QA, VLEP, and TVQA datasets are held-out datasets that we did not use during our experiments, and only evaluated once after all the design is completed. |
| Hardware Specification | Yes | Our training is efficient for using multiple visual models, completing in under 24 hours using 8 L40-48GB GPUs, and down to 8 hours using 8 H100s. |
| Software Dependencies | No | Our code is built on top of the Prismatic VLM codebase (Karamcheti et al., 2024), which efficiently implements vision-language model (VLM) training. We add the ability to handle videos and an arbitrary number of visual encoders, along with many useful features for training. ... We use Py Torch s Fully Sharded Data Parallel (Zhao et al., 2023). ... We use the base LLa MA-2 7B model (Touvron et al., 2023b). |
| Experiment Setup | Yes | For MERV (frozen), we train on only Stage 2 data for 1 epoch with a learning rate of 2 10 5 and a batch size of 128 with gradient accumulation. For MERV (full), we first train on Stage 1 data with a learning rate of 1 10 4 and the projectors, feature fusion, and LLM unfrozen with similar settings. Both recipes use an initial warmup ratio of 0.03 and a cosine schedule. |