Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Authors: Qirui Chen, Shangzhe Di, Weidi Xie

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results reveal that existing multimodal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (Ge LM), that enhances multi-modal large language models by incorporating a grounding module to retrieve temporal evidence from videos using flexible grounding tokens. Trained on our visual instruction-tuning data, Ge LM demonstrates improved multi-hop grounding and reasoning capabilities, setting a baseline for this new task. Furthermore, when trained on third-person view videos, the same architecture also achieves state-of-the-art performance on the single-hop Vid QA benchmark, Activity Net-RTL, demonstrating its effectiveness.
Researcher Affiliation Academia Qirui Chen1,2, Shangzhe Di1,2, Weidi Xie1* 1School of Artificial Intelligence, Shanghai Jiao Tong University, China 2Coop. Medianet Innovation Center, Shanghai Jiao Tong University, China EMAIL
Pseudocode No The paper describes methods in prose and through mathematical formulations and diagrams (e.g., Figure 3), but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://qirui-chen.github.io/Multi Hop-Ego QA
Open Datasets Yes Recently, the introduction of Ego4D dataset (Grauman et al. 2022) has enabled a series of research in visual-language understanding in egocentric videos... To monitor the progress of this new task, we further curate a high-quality benchmark, MULTIHOP-EGOQA, with careful manual verification and refinement... We also evaluate our architecture on another public single-hop Vid QA benchmark, Activity Net-RTL (Huang et al. 2024b), outperforming existing approaches by a large margin.
Dataset Splits No The paper states: "We randomly sample 10% of the test split and request participants to answer the questions and localise relevant time spans." and "We utilize the triplets generated in our automated pipeline to train the multi-modal LLM and the grounding module. These triplets have been filtered by the LLM, but not manually refined in Stage IV, consisting of 3,156 clips with a total of 10,414 samples." However, it does not explicitly provide the train/validation/test splits (e.g., percentages or exact counts) for the MULTIHOP-EGOQA benchmark or for Activity Net-RTL in the main text.
Hardware Specification Yes The experiments are conducted using 4 NVIDIA H800 (80GB) GPUs, with a batch size of 32 per device.
Software Dependencies Yes The large language model employed is Vicuna-7B v1.3 (Chiang et al. 2023).
Experiment Setup Yes The large language model employed is Vicuna-7B v1.3 (Chiang et al. 2023). The dimensions of the hidden states for the LLM and the grounding module are 4096 and 1024, respectively. The experiments are conducted using 4 NVIDIA H800 (80GB) GPUs, with a batch size of 32 per device. The model is trained for 10 epochs with a learning rate of 2 10 5, employing a warmup cosine decay strategy.