Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge

Authors: Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, Huchuan Lu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on STREAMBENCH and other public benchmarks demonstrate that STREAMCHAT significantly outperforms existing state-of-the-art models in terms of accuracy and response times, confirming its effectiveness for streaming video understanding. ... STREAMCHAT sets new benchmarks (cf. Fig. 1), delivering a 64.7% accuracy on STREAMBENCH for online settings, which is an 8.3% improvement over the previous best. In offline scenarios, it outperforms the state-of-the-art method by an average of 2.5% across four public benchmarks.
Researcher Affiliation Academia Haomiao Xiong1 , Zongxin Yang2 , Jiazuo Yu1, Yunzhi Zhuge1 , Lu Zhang1, Jiawen Zhu1, Huchuan Lu1 1Dalian University of Technology, 2Harvard University Corresponding author (EMAIL)
Pseudocode Yes Algorithm 1 Knowledge Retrieval from Long-Term Memory
Open Source Code No Code is available at Stream Chat.
Open Datasets Yes Our primary sources are the Ego Schema [13] and You Tube-8M [19] datasets. ... We evaluate STREAMCHAT on existing benchmarks [12 16]... MSRVTT [12], Activity Net [14], NEx T-QA [15], MSVDQA [16].
Dataset Splits No The paper describes the STREAMBENCH dataset (306 videos, 24.8 hours total) and mentions using other benchmarks like MSVD, MSRVTT, Activity Net, and Next-QA. However, it does not specify explicit training, validation, or test splits (e.g., percentages or sample counts) for any of these datasets in the main text.
Hardware Specification Yes Experiments were conducted on two NVIDIA Tesla A800 GPUs with 80GB of memory each (more details in Appen. F ).
Software Dependencies No The paper mentions using specific models and algorithms such as CLIP-L-P14 [26], LLa MA-3 model [3], Lucas-Kanade Optical Flow algorithm [21], and Mini LM-L6 [24] as the encoder model, and FAISS [25] index. However, it does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes Memory Configurations. To adapt the model to various application scenarios, we configure three versions with different memory settings: Base, Fast, and Slow. These variants adjust key memory parameters, including threshold (t), chunk length (L), group size (g), and clustering goals (C), as summarized in Tab. 3. ... We utilize CLIP-L-P14 [26] as the vision encoder and we set the number of selected memory units S to 5 and candidate length C to 20.