reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Streaming Video Question-Answering with In-context Video KV-Cache Retrieval

Authors: Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, ZhongTao, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive experimentation, we validate the efﬁcacy and practicality of our approach, which signiﬁcantly boosts efﬁciency and enhances applicability over existing Video QA models. In Section 4, we present ablation studies and comparisons to validate our approach. Figure 1: ... Tested with LLa VA-OV-7B on an H800 (80GB) GPU, Re KV maintains stable latency and GPU memory usage, preventing out-of-memory (OOM) errors as frames increase. It also improves the accuracy on seven long-form Video QA benchmarks compared to the uniform sampling baseline.
Researcher Affiliation	Collaboration	Shangzhe Di1,2 Zhelun Yu2 Guanghao Zhang2 Haoyuan Li2 Tao Zhong2 Hao Cheng2 Bolin Li2 Wanggui He2 Fangxun Shu2 Hao Jiang2 1Shanghai Jiao Tong University 2Alibaba Group EMAIL
Pseudocode	No	The paper describes the proposed method conceptually and through mathematical formulations (Equations 1, 2, 3), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or provide any links to code repositories.
Open Datasets	Yes	MLVUdev-mc (Zhou et al., 2024a) is the multiple-choice subset of the MLVU-dev benchmark. QAEGO4Dtest-mc (Di & Xie, 2024) is the multiple-choice subset of the QAEGO4D-test benchmark, focusing on question-answering in long egocentric videos. Ego Schema (Mangalam et al., 2023) is a diagnostic benchmark for long Video QA, featuring over 5000 multiple-choice questions and long temporal certiﬁcate length. Activity Net-QA (Yu et al., 2019) encompasses human-annotated QA pairs on 5,800 videos derived from the Activity Net (Caba Heilbron et al., 2015) dataset. RVS-Ego and RVS-Movie (Zhang et al., 2024a) are Streaming Video QA benchmarks, constructed using 10 long videos from the Ego4D dataset (Grauman et al., 2022) and 22 long videos from the Movie Net dataset (Huang et al., 2020), respectively. CGBenchmc (Chen et al., 2025a), the multiple-choice subset of CGBench, is designed for clue-grounded question answering in long videos.
Dataset Splits	No	The paper references specific subsets of established benchmarks, such as 'MLVUdev-mc' and 'QAEGO4Dtest-mc', which implies predefined splits are used for evaluation. However, it does not explicitly state the dataset split percentages, sample counts, or the methodology for creating these splits, which would be necessary for reproduction beyond using the designated benchmark subsets.
Hardware Specification	Yes	All experiments are conducted on NVIDIA A100 (80GB) GPUs with FP16 precision. Tested with LLa VA-OV-7B on an H800 (80GB) GPU, Re KV maintains stable latency and GPU memory usage, preventing out-of-memory (OOM) errors as frames increase.
Software Dependencies	No	The paper mentions using specific models like LLaVA-OV-0.5B and LLaVA-OV-7B and a retriever like SigLIP-SO400M, and refers to FP16 precision. However, it does not provide specific version numbers for general software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	For video modeling, we process the video stream at 0.5 FPS, in line with GPT-4o s testing on MLVU (Zhou et al., 2024a). The local window size is set to 15K. For external video KV-Cache retrieval, we use Sig LIP-SO400M (Zhai et al., 2023) as the retriever. For internal KV-Cache retrieval, we set the block size (b) to 1 and the number of retrieved frames (r) to 64 by default, with further hyperparameter variations explored in Section 4.3.