reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploring the Design Space of Visual Context Representation in Video MLLMs

Authors: Yifan Du, Yuqi Huo, Kun Zhou, Zijia Zhao, Haoyu Lu, Han Huang, Xin Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we explore the design space for visual context representation, and aim to improve the performance of video MLLMs by finding more effective representation schemes. Firstly, we formulate the task of visual context representation as a constrained optimization problem, and model the language modeling loss as a function of the number of frames and the number of embeddings (or tokens) per frame, given the maximum visual context window size. Then, we explore the scaling effects in frame selection and token selection respectively, and fit the corresponding function curve by conducting extensive empirical experiments. We examine the effectiveness of typical selection strategies and present empirical findings to determine the two factors.
Researcher Affiliation	Collaboration	1Gaoling School of Artificial Intelligence, Renmin University of China 2Baichuan Inc. 3School of Information, Renmin University of China 4Institute of Automation, Chinese Academy of Sciences
Pseudocode	No	The paper describes its methodology through textual explanations and mathematical formulations, but it does not contain any explicitly labeled pseudocode blocks or algorithm listings with structured, code-like steps.
Open Source Code	Yes	The data and code are available at: https://github.com/RUCAIBox/Opt-Visor.
Open Datasets	Yes	For the image instruction set, we adopt Cauldron (Laurenc on et al., 2024b)... For the video instruction set, we collect the instructions from Video Chat GPT-100K (Muhammad Maaz & Khan, 2023), Share GPT4Video (Chen et al., 2024), Share GPTVideo (Zhang et al., 2024b), VIM (Du et al., 2024), as well as some instruction data from Video Chat2 (Li et al., 2024b).
Dataset Splits	No	The paper mentions mixing several instruction datasets to construct a new instruction dataset (Table 6). It also states that "each model is trained for one epoch to ensure that training samples are unseen when calculating the loss for evaluation" and that "The zero-shot accuracy can reflect the performance of the model in the realworld application. We select several long video understanding benchmarks for evaluation, including Event-Bench (Du et al., 2024) (only with the challenging episodic reasoning task), VNBench (Zhao et al., 2024), MLVU (Zhou et al., 2024), and Video MME (Fu et al., 2024)." While these indicate a division for training and evaluation, the specific split percentages or sample counts for the combined new instruction dataset (train/validation/test) are not explicitly provided.
Hardware Specification	Yes	All the experiments are conducted on 32 Nvidia H800
Software Dependencies	No	The paper mentions using specific models like "Sig LIP (Zhai et al., 2023) as the image encoder, Qwen27B (Yang et al., 2024) as the base LLM", but it does not provide explicit version numbers for general software dependencies or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	All the experiments are conducted on 32 Nvidia H800, with the detailed hyperparameters listed in Table 7. ... Table 7: Training hyperparameter. Global batch size 64, Gradient clipping 1, Weight decay 0, Warmup ratio 0.03, LLM lr 2e-5, Projector lr 1e-4, Vision encoder lr 2e-6, lr schedule cosine