Exploring the Design Space of Visual Context Representation in Video MLLMs

Authors: Yifan Du, Yuqi Huo, Kun Zhou, Zijia Zhao, Haoyu Lu, Han Huang, Xin Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we explore the design space for visual context representation, and aim to improve the performance of video MLLMs by finding more effective representation schemes. Firstly, we formulate the task of visual context representation as a constrained optimization problem, and model the language modeling loss as a function of the number of frames and the number of embeddings (or tokens) per frame, given the maximum visual context window size. Then, we explore the scaling effects in frame selection and token selection respectively, and fit the corresponding function curve by conducting extensive empirical experiments. We examine the effectiveness of typical selection strategies and present empirical findings to determine the two factors.
Researcher Affiliation Collaboration 1Gaoling School of Artificial Intelligence, Renmin University of China 2Baichuan Inc. 3School of Information, Renmin University of China 4Institute of Automation, Chinese Academy of Sciences
Pseudocode No The paper describes its methodology through textual explanations and mathematical formulations, but it does not contain any explicitly labeled pseudocode blocks or algorithm listings with structured, code-like steps.
Open Source Code Yes The data and code are available at: https://github.com/RUCAIBox/Opt-Visor.
Open Datasets Yes For the image instruction set, we adopt Cauldron (Laurenc on et al., 2024b)... For the video instruction set, we collect the instructions from Video Chat GPT-100K (Muhammad Maaz & Khan, 2023), Share GPT4Video (Chen et al., 2024), Share GPTVideo (Zhang et al., 2024b), VIM (Du et al., 2024), as well as some instruction data from Video Chat2 (Li et al., 2024b).
Dataset Splits No The paper mentions mixing several instruction datasets to construct a new instruction dataset (Table 6). It also states that "each model is trained for one epoch to ensure that training samples are unseen when calculating the loss for evaluation" and that "The zero-shot accuracy can reflect the performance of the model in the realworld application. We select several long video understanding benchmarks for evaluation, including Event-Bench (Du et al., 2024) (only with the challenging episodic reasoning task), VNBench (Zhao et al., 2024), MLVU (Zhou et al., 2024), and Video MME (Fu et al., 2024)." While these indicate a division for training and evaluation, the specific split percentages or sample counts for the combined new instruction dataset (train/validation/test) are not explicitly provided.
Hardware Specification Yes All the experiments are conducted on 32 Nvidia H800
Software Dependencies No The paper mentions using specific models like "Sig LIP (Zhai et al., 2023) as the image encoder, Qwen27B (Yang et al., 2024) as the base LLM", but it does not provide explicit version numbers for general software dependencies or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes All the experiments are conducted on 32 Nvidia H800, with the detailed hyperparameters listed in Table 7. ... Table 7: Training hyperparameter. Global batch size 64, Gradient clipping 1, Weight decay 0, Warmup ratio 0.03, LLM lr 2e-5, Projector lr 1e-4, Vision encoder lr 2e-6, lr schedule cosine