Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos
Authors: Qirui Chen, Shangzhe Di, Weidi Xie
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results reveal that existing multimodal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (Ge LM), that enhances multi-modal large language models by incorporating a grounding module to retrieve temporal evidence from videos using flexible grounding tokens. Trained on our visual instruction-tuning data, Ge LM demonstrates improved multi-hop grounding and reasoning capabilities, setting a baseline for this new task. Furthermore, when trained on third-person view videos, the same architecture also achieves state-of-the-art performance on the single-hop Vid QA benchmark, Activity Net-RTL, demonstrating its effectiveness. |
| Researcher Affiliation | Academia | Qirui Chen1,2, Shangzhe Di1,2, Weidi Xie1* 1School of Artificial Intelligence, Shanghai Jiao Tong University, China 2Coop. Medianet Innovation Center, Shanghai Jiao Tong University, China EMAIL |
| Pseudocode | No | The paper describes methods in prose and through mathematical formulations and diagrams (e.g., Figure 3), but does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://qirui-chen.github.io/Multi Hop-Ego QA |
| Open Datasets | Yes | Recently, the introduction of Ego4D dataset (Grauman et al. 2022) has enabled a series of research in visual-language understanding in egocentric videos... To monitor the progress of this new task, we further curate a high-quality benchmark, MULTIHOP-EGOQA, with careful manual verification and refinement... We also evaluate our architecture on another public single-hop Vid QA benchmark, Activity Net-RTL (Huang et al. 2024b), outperforming existing approaches by a large margin. |
| Dataset Splits | No | The paper states: "We randomly sample 10% of the test split and request participants to answer the questions and localise relevant time spans." and "We utilize the triplets generated in our automated pipeline to train the multi-modal LLM and the grounding module. These triplets have been filtered by the LLM, but not manually refined in Stage IV, consisting of 3,156 clips with a total of 10,414 samples." However, it does not explicitly provide the train/validation/test splits (e.g., percentages or exact counts) for the MULTIHOP-EGOQA benchmark or for Activity Net-RTL in the main text. |
| Hardware Specification | Yes | The experiments are conducted using 4 NVIDIA H800 (80GB) GPUs, with a batch size of 32 per device. |
| Software Dependencies | Yes | The large language model employed is Vicuna-7B v1.3 (Chiang et al. 2023). |
| Experiment Setup | Yes | The large language model employed is Vicuna-7B v1.3 (Chiang et al. 2023). The dimensions of the hidden states for the LLM and the grounding module are 4096 and 1024, respectively. The experiments are conducted using 4 NVIDIA H800 (80GB) GPUs, with a batch size of 32 per device. The model is trained for 10 epochs with a learning rate of 2 10 5, employing a warmup cosine decay strategy. |