reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge

Authors: Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, Huchuan Lu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations on STREAMBENCH and other public benchmarks demonstrate that STREAMCHAT significantly outperforms existing state-of-the-art models in terms of accuracy and response times, confirming its effectiveness for streaming video understanding. ... STREAMCHAT sets new benchmarks (cf. Fig. 1), delivering a 64.7% accuracy on STREAMBENCH for online settings, which is an 8.3% improvement over the previous best. In offline scenarios, it outperforms the state-of-the-art method by an average of 2.5% across four public benchmarks.
Researcher Affiliation	Academia	Haomiao Xiong1 , Zongxin Yang2 , Jiazuo Yu1, Yunzhi Zhuge1 , Lu Zhang1, Jiawen Zhu1, Huchuan Lu1 1Dalian University of Technology, 2Harvard University Corresponding author (EMAIL)
Pseudocode	Yes	Algorithm 1 Knowledge Retrieval from Long-Term Memory
Open Source Code	No	Code is available at Stream Chat.
Open Datasets	Yes	Our primary sources are the Ego Schema [13] and You Tube-8M [19] datasets. ... We evaluate STREAMCHAT on existing benchmarks [12 16]... MSRVTT [12], Activity Net [14], NEx T-QA [15], MSVDQA [16].
Dataset Splits	No	The paper describes the STREAMBENCH dataset (306 videos, 24.8 hours total) and mentions using other benchmarks like MSVD, MSRVTT, Activity Net, and Next-QA. However, it does not specify explicit training, validation, or test splits (e.g., percentages or sample counts) for any of these datasets in the main text.
Hardware Specification	Yes	Experiments were conducted on two NVIDIA Tesla A800 GPUs with 80GB of memory each (more details in Appen. F ).
Software Dependencies	No	The paper mentions using specific models and algorithms such as CLIP-L-P14 [26], LLa MA-3 model [3], Lucas-Kanade Optical Flow algorithm [21], and Mini LM-L6 [24] as the encoder model, and FAISS [25] index. However, it does not provide specific version numbers for these software components or libraries.
Experiment Setup	Yes	Memory Configurations. To adapt the model to various application scenarios, we configure three versions with different memory settings: Base, Fast, and Slow. These variants adjust key memory parameters, including threshold (t), chunk length (L), group size (g), and clustering goals (C), as summarized in Tab. 3. ... We utilize CLIP-L-P14 [26] as the vision encoder and we set the number of selected memory units S to 5 and candidate length C to 20.