reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Authors: Qirui Chen, Shangzhe Di, Weidi Xie

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results reveal that existing multimodal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (Ge LM), that enhances multi-modal large language models by incorporating a grounding module to retrieve temporal evidence from videos using flexible grounding tokens. Trained on our visual instruction-tuning data, Ge LM demonstrates improved multi-hop grounding and reasoning capabilities, setting a baseline for this new task. Furthermore, when trained on third-person view videos, the same architecture also achieves state-of-the-art performance on the single-hop Vid QA benchmark, Activity Net-RTL, demonstrating its effectiveness.
Researcher Affiliation	Academia	Qirui Chen1,2, Shangzhe Di1,2, Weidi Xie1* 1School of Artificial Intelligence, Shanghai Jiao Tong University, China 2Coop. Medianet Innovation Center, Shanghai Jiao Tong University, China EMAIL
Pseudocode	No	The paper describes methods in prose and through mathematical formulations and diagrams (e.g., Figure 3), but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://qirui-chen.github.io/Multi Hop-Ego QA
Open Datasets	Yes	Recently, the introduction of Ego4D dataset (Grauman et al. 2022) has enabled a series of research in visual-language understanding in egocentric videos... To monitor the progress of this new task, we further curate a high-quality benchmark, MULTIHOP-EGOQA, with careful manual verification and refinement... We also evaluate our architecture on another public single-hop Vid QA benchmark, Activity Net-RTL (Huang et al. 2024b), outperforming existing approaches by a large margin.
Dataset Splits	No	The paper states: "We randomly sample 10% of the test split and request participants to answer the questions and localise relevant time spans." and "We utilize the triplets generated in our automated pipeline to train the multi-modal LLM and the grounding module. These triplets have been filtered by the LLM, but not manually refined in Stage IV, consisting of 3,156 clips with a total of 10,414 samples." However, it does not explicitly provide the train/validation/test splits (e.g., percentages or exact counts) for the MULTIHOP-EGOQA benchmark or for Activity Net-RTL in the main text.
Hardware Specification	Yes	The experiments are conducted using 4 NVIDIA H800 (80GB) GPUs, with a batch size of 32 per device.
Software Dependencies	Yes	The large language model employed is Vicuna-7B v1.3 (Chiang et al. 2023).
Experiment Setup	Yes	The large language model employed is Vicuna-7B v1.3 (Chiang et al. 2023). The dimensions of the hidden states for the LLM and the grounding module are 4096 and 1024, respectively. The experiments are conducted using 4 NVIDIA H800 (80GB) GPUs, with a batch size of 32 per device. The model is trained for 10 epochs with a learning rate of 2 10 5, employing a warmup cosine decay strategy.