Granularity-Adaptive Spatial Evidence Tokenization for Video Question Answering

Authors: Hao Jiang, Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Yang Song, Kun Gai, Yadong Mu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on 11 mainstream video question answering datasets and the experimental results demonstrate the effectiveness of our proposed method.
Researcher Affiliation Collaboration 1Peking University 2Kuaishou Technology EMAIL, EMAIL
Pseudocode No The paper describes the methodology using natural language and mathematical equations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes We release the code of this work to facilitate future work1. 1https://code-website.wixsite.com/videoqa.
Open Datasets Yes We evaluate the model performance on 11 mainstream video question answering datasets, including 4 open-ended question answering benchmarks (Xu et al. 2017; Yu et al. 2019; Li et al. 2016), a text generation benchmark (Maaz et al. 2024), 5 multiple-choice benchmarks (Xiao et al. 2021; Wu et al. 2023; Lei et al. 2018; Li et al. 2023a; Mangalam, Akshulakov, and Malik 2023), and a recently proposed comprehensive video understanding benchmark (Li et al. 2024a).
Dataset Splits No The paper states: "Following (Li et al. 2024b), a total of 790K pairs are used in pre-training and 763K samples are utilized for instruction tuning." While this provides training data counts, it does not specify the explicit training, validation, and test splits (e.g., percentages or exact counts for each split) for the 11 mainstream video question answering datasets used for evaluation.
Hardware Specification No The paper does not provide specific details about the hardware used for running its experiments, such as GPU models, CPU types, or memory.
Software Dependencies No The paper mentions models like "Vicuna-7B (Chiang et al. 2023)", "EVA-G (Sun et al. 2023)", and "QFormer in Instruct BLIP (Dai et al. 2023)" but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We use Vicuna-7B (Chiang et al. 2023) as the LLM, and the visual encoder is EVA-G (Sun et al. 2023). Pre-trained QFormer in Instruct BLIP (Dai et al. 2023) is employed for feature fusion between frames and questions. wl and wh are 224 and 448 respectively. Two-layer MLPs are used to project visual tokens into LLM semantic space. During the pre-training stage, only the projection layer is trained, while in the instruction tuning phase, LLM, Q-Former, and projection layer are trained. Following (Li et al. 2024b), a total of 790K pairs are used in pre-training and 763K samples are utilized for instruction tuning. ΞΎ is set to 0.4.