reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

Authors: Guo Chen, Yicheng Liu, Yifei Huang, Baoqi Pei, Jilan Xu, Yuping He, Tong Lu, Yali Wang, Limin Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate multiple closed-source and open-source MLLMs on CG-Bench. The results show that current models struggle significantly with long videos compared to short ones, and there is a notable gap between open-source and commercial models.
Researcher Affiliation	Academia	1State Key Laboratory for Novel Software Technology, Nanjing University 2Shanghai Artificial Intelligence Laboratory 3The University of Tokyo 4Zhejiang University, 5Fudan University
Pseudocode	No	The paper describes methods and evaluation processes in detail, but it does not contain any explicitly labeled pseudocode or algorithm blocks. Methodologies are explained in prose.
Open Source Code	No	All annotations and video data are available at https://cg-bench.github.io/leaderboard/. The paper describes the code for various MLLMs (e.g., LLaVA-Next-Video, LLaVA-One Vision) which are evaluated, but it does not explicitly state that the source code for the methodology or evaluation framework developed in this paper (CG-Bench) is being released or provide a specific link to it.
Open Datasets	Yes	All annotations and video data are available at https://cg-bench.github.io/leaderboard/.
Dataset Splits	No	The paper describes the creation of the CG-Bench dataset and states it is a 'held-out Video QA and question grounding benchmark'. While it mentions sampling subsets for specific analyses (e.g., '200 evaluation samples' for open-ended evaluation, '1000 QAC triplets sampled uniformly from all annotations for fast experiments'), it does not explicitly provide overall training, validation, and test dataset splits with percentages or sample counts for the CG-Bench itself.
Hardware Specification	No	For open-source MLLMs, we make the best use of our computational resources to use as many frames as possible. For closed-source MLLMs, since the local computational resource is no longer a bottleneck, we can use even more frames. The paper mentions 'computational resources' and 'hardware limitations' but does not specify any particular GPU models, CPU types, or other detailed hardware specifications used for running the experiments or training the models.
Software Dependencies	No	The paper mentions using specific multimodal large language models (MLLMs) like GPT-4o, Gemini-1.5 Pro, and Qwen2-VL for evaluation, and GPT-4 and Qwen2.5 for quality control, but it does not list the specific versions of programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA) that would be needed to reproduce the experiment setup for CG-Bench itself.
Experiment Setup	Yes	We evaluate the performance of three mainstream commercial models on our CG-Bench: GPT4o (Open AI, 2024), Gemini-1.5 (Anil et al., 2023), and Claude-3.5, including their different versions. Also, we assess the representative open-source image-MLLMs, such as LLaVA-OV (Li et al., 2024a), Qwen2-VL (Wang et al., 2024b) and Intern VL2 (Chen et al., 2024e), video-MLLMs, such as Video Chat2 (Li et al., 2023b). For long video understanding, the frame sampling strategy significantly impacts evaluation results. For open-source MLLMs, we make the best use of our computational resources to use as many frames as possible. For closed-source MLLMs, since the local computational resource is no longer a bottleneck, we can use even more frames. We uniformly sample (Wang et al., 2019) 128 frames for Long-video MCQ, and use 32 frames as the for Clue-based MCQ. For subtitles, we employ a uniform sampling method. If the timestamp of a sampled frame falls within the time interval of a subtitle, that subtitle will be included in the analysis. Each subtitle is considered only once to avoid redundancy. For MCQ tasks, the model is prompted to provide the uppercase letter corresponding to the correct option. In Open-Ended QA tasks, the model responds freely based on the questions. For the Clue Grounding task, we append the timestamps of each frame and subtitle to enhance the model s time-awareness, requiring it to return nested lists in the format [[s1, e1], [s2, e2], ...]. For open-ended evaluation, we require the model to assess the correctness between the predictions and the ground truth and respond with yes or no.