Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
Authors: Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, ZhongTao, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive experimentation, we validate the efficacy and practicality of our approach, which significantly boosts efficiency and enhances applicability over existing Video QA models. In Section 4, we present ablation studies and comparisons to validate our approach. Figure 1: ... Tested with LLa VA-OV-7B on an H800 (80GB) GPU, Re KV maintains stable latency and GPU memory usage, preventing out-of-memory (OOM) errors as frames increase. It also improves the accuracy on seven long-form Video QA benchmarks compared to the uniform sampling baseline. |
| Researcher Affiliation | Collaboration | Shangzhe Di1,2 Zhelun Yu2 Guanghao Zhang2 Haoyuan Li2 Tao Zhong2 Hao Cheng2 Bolin Li2 Wanggui He2 Fangxun Shu2 Hao Jiang2 1Shanghai Jiao Tong University 2Alibaba Group EMAIL |
| Pseudocode | No | The paper describes the proposed method conceptually and through mathematical formulations (Equations 1, 2, 3), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or provide any links to code repositories. |
| Open Datasets | Yes | MLVUdev-mc (Zhou et al., 2024a) is the multiple-choice subset of the MLVU-dev benchmark. QAEGO4Dtest-mc (Di & Xie, 2024) is the multiple-choice subset of the QAEGO4D-test benchmark, focusing on question-answering in long egocentric videos. Ego Schema (Mangalam et al., 2023) is a diagnostic benchmark for long Video QA, featuring over 5000 multiple-choice questions and long temporal certificate length. Activity Net-QA (Yu et al., 2019) encompasses human-annotated QA pairs on 5,800 videos derived from the Activity Net (Caba Heilbron et al., 2015) dataset. RVS-Ego and RVS-Movie (Zhang et al., 2024a) are Streaming Video QA benchmarks, constructed using 10 long videos from the Ego4D dataset (Grauman et al., 2022) and 22 long videos from the Movie Net dataset (Huang et al., 2020), respectively. CGBenchmc (Chen et al., 2025a), the multiple-choice subset of CGBench, is designed for clue-grounded question answering in long videos. |
| Dataset Splits | No | The paper references specific subsets of established benchmarks, such as 'MLVUdev-mc' and 'QAEGO4Dtest-mc', which implies predefined splits are used for evaluation. However, it does not explicitly state the dataset split percentages, sample counts, or the methodology for creating these splits, which would be necessary for reproduction beyond using the designated benchmark subsets. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA A100 (80GB) GPUs with FP16 precision. Tested with LLa VA-OV-7B on an H800 (80GB) GPU, Re KV maintains stable latency and GPU memory usage, preventing out-of-memory (OOM) errors as frames increase. |
| Software Dependencies | No | The paper mentions using specific models like LLaVA-OV-0.5B and LLaVA-OV-7B and a retriever like SigLIP-SO400M, and refers to FP16 precision. However, it does not provide specific version numbers for general software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For video modeling, we process the video stream at 0.5 FPS, in line with GPT-4o s testing on MLVU (Zhou et al., 2024a). The local window size is set to 15K. For external video KV-Cache retrieval, we use Sig LIP-SO400M (Zhai et al., 2023) as the retriever. For internal KV-Cache retrieval, we set the block size (b) to 1 and the number of retrieved frames (r) to 64 by default, with further hyperparameter variations explored in Section 4.3. |