Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge
Authors: Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, Huchuan Lu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations on STREAMBENCH and other public benchmarks demonstrate that STREAMCHAT significantly outperforms existing state-of-the-art models in terms of accuracy and response times, confirming its effectiveness for streaming video understanding. ... STREAMCHAT sets new benchmarks (cf. Fig. 1), delivering a 64.7% accuracy on STREAMBENCH for online settings, which is an 8.3% improvement over the previous best. In offline scenarios, it outperforms the state-of-the-art method by an average of 2.5% across four public benchmarks. |
| Researcher Affiliation | Academia | Haomiao Xiong1 , Zongxin Yang2 , Jiazuo Yu1, Yunzhi Zhuge1 , Lu Zhang1, Jiawen Zhu1, Huchuan Lu1 1Dalian University of Technology, 2Harvard University Corresponding author (EMAIL) |
| Pseudocode | Yes | Algorithm 1 Knowledge Retrieval from Long-Term Memory |
| Open Source Code | No | Code is available at Stream Chat. |
| Open Datasets | Yes | Our primary sources are the Ego Schema [13] and You Tube-8M [19] datasets. ... We evaluate STREAMCHAT on existing benchmarks [12 16]... MSRVTT [12], Activity Net [14], NEx T-QA [15], MSVDQA [16]. |
| Dataset Splits | No | The paper describes the STREAMBENCH dataset (306 videos, 24.8 hours total) and mentions using other benchmarks like MSVD, MSRVTT, Activity Net, and Next-QA. However, it does not specify explicit training, validation, or test splits (e.g., percentages or sample counts) for any of these datasets in the main text. |
| Hardware Specification | Yes | Experiments were conducted on two NVIDIA Tesla A800 GPUs with 80GB of memory each (more details in Appen. F ). |
| Software Dependencies | No | The paper mentions using specific models and algorithms such as CLIP-L-P14 [26], LLa MA-3 model [3], Lucas-Kanade Optical Flow algorithm [21], and Mini LM-L6 [24] as the encoder model, and FAISS [25] index. However, it does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | Memory Configurations. To adapt the model to various application scenarios, we configure three versions with different memory settings: Base, Fast, and Slow. These variants adjust key memory parameters, including threshold (t), chunk length (L), group size (g), and clustering goals (C), as summarized in Tab. 3. ... We utilize CLIP-L-P14 [26] as the vision encoder and we set the number of selected memory units S to 5 and candidate length C to 20. |