reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unifying Specialized Visual Encoders for Video Language Models

Authors: Jihoon Chung, Tyler Zhu, Max Gonzalez Saez-Diez, Juan Carlos Niebles, Honglu Zhou, Olga Russakovsky

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Under fair comparison, MERV achieves up to 4.62% higher accuracy than its base model, while introducing minimal extra parameters and training faster than equivalent singleencoder methods after parallelizing visual processing. Qualitative analysis shows MERV successfully captures and integrates domain knowledge from each encoder, opening new possibilities for scaling enhanced video understanding.
Researcher Affiliation	Collaboration	1Department of Computer Science, Princeton University, Princeton, NJ, United States 2Salesforce Research, Palo Alto, CA, United States. Correspondence to: Jihoon Chung, Tyler Zhu <EMAIL>.
Pseudocode	No	The paper describes its methodology using textual descriptions and mathematical equations (e.g., equations 1 and 2 in section 3.2), but it does not contain explicit pseudocode blocks or algorithms labeled as such.
Open Source Code	Yes	1Our code and pretrained weights are available at https://github.com/princetonvisualai/merv.
Open Datasets	Yes	Our data mix is the same as Video-LLa VA (Lin et al., 2024). The Stage 1 data is single-turn concise captioning, with 558k (image, text) pairs from LAION filtered by LLa VA (Liu et al., 2023) and 702k (video, text) pairs from Valley (Luo et al., 2023). The Stage 2 data is multi-turn conversations, detailed captioning and reasoning, with 665k (image, text) pairs from LLa VA and 100k (video, text) instructions from Video-Chat GPT (Maaz et al., 2024). We evaluate our model on a comprehensive suite of video understanding benchmarks, including the open-ended MSVD-QA (Xu et al., 2017), MSRVTT-QA (Xu et al., 2017), TGIF (Jang et al., 2017), and Activity Net-QA (Yu et al., 2019), as well as the multiple-choice benchmarks NEx T-QA (Xiao et al., 2021), VLEP (Lei et al., 2020), TVQA (Lei et al., 2018), and Perception Test (P atr aucean et al., 2023).
Dataset Splits	Yes	For fair comparison, our data mix is the same as Video-LLa VA (Lin et al., 2024). The Stage 1 data is single-turn concise captioning, with 558k (image, text) pairs from LAION filtered by LLa VA (Liu et al., 2023) and 702k (video, text) pairs from Valley (Luo et al., 2023). The Stage 2 data is multi-turn conversations, detailed captioning and reasoning, with 665k (image, text) pairs from LLa VA and 100k (video, text) instructions from Video-Chat GPT (Maaz et al., 2024). We emphasize that NEx T-QA, VLEP, and TVQA datasets are held-out datasets that we did not use during our experiments, and only evaluated once after all the design is completed.
Hardware Specification	Yes	Our training is efficient for using multiple visual models, completing in under 24 hours using 8 L40-48GB GPUs, and down to 8 hours using 8 H100s.
Software Dependencies	No	Our code is built on top of the Prismatic VLM codebase (Karamcheti et al., 2024), which efficiently implements vision-language model (VLM) training. We add the ability to handle videos and an arbitrary number of visual encoders, along with many useful features for training. ... We use Py Torch s Fully Sharded Data Parallel (Zhao et al., 2023). ... We use the base LLa MA-2 7B model (Touvron et al., 2023b).
Experiment Setup	Yes	For MERV (frozen), we train on only Stage 2 data for 1 epoch with a learning rate of 2 10 5 and a batch size of 128 with gradient accumulation. For MERV (full), we first train on Stage 1 data with a learning rate of 1 10 4 and the projectors, feature fusion, and LLM unfrozen with similar settings. Both recipes use an initial warmup ratio of 0.03 and a cosine schedule.