Granularity-Adaptive Spatial Evidence Tokenization for Video Question Answering
Authors: Hao Jiang, Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Yang Song, Kun Gai, Yadong Mu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on 11 mainstream video question answering datasets and the experimental results demonstrate the effectiveness of our proposed method. |
| Researcher Affiliation | Collaboration | 1Peking University 2Kuaishou Technology EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using natural language and mathematical equations but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release the code of this work to facilitate future work1. 1https://code-website.wixsite.com/videoqa. |
| Open Datasets | Yes | We evaluate the model performance on 11 mainstream video question answering datasets, including 4 open-ended question answering benchmarks (Xu et al. 2017; Yu et al. 2019; Li et al. 2016), a text generation benchmark (Maaz et al. 2024), 5 multiple-choice benchmarks (Xiao et al. 2021; Wu et al. 2023; Lei et al. 2018; Li et al. 2023a; Mangalam, Akshulakov, and Malik 2023), and a recently proposed comprehensive video understanding benchmark (Li et al. 2024a). |
| Dataset Splits | No | The paper states: "Following (Li et al. 2024b), a total of 790K pairs are used in pre-training and 763K samples are utilized for instruction tuning." While this provides training data counts, it does not specify the explicit training, validation, and test splits (e.g., percentages or exact counts for each split) for the 11 mainstream video question answering datasets used for evaluation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running its experiments, such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions models like "Vicuna-7B (Chiang et al. 2023)", "EVA-G (Sun et al. 2023)", and "QFormer in Instruct BLIP (Dai et al. 2023)" but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We use Vicuna-7B (Chiang et al. 2023) as the LLM, and the visual encoder is EVA-G (Sun et al. 2023). Pre-trained QFormer in Instruct BLIP (Dai et al. 2023) is employed for feature fusion between frames and questions. wl and wh are 224 and 448 respectively. Two-layer MLPs are used to project visual tokens into LLM semantic space. During the pre-training stage, only the projection layer is trained, while in the instruction tuning phase, LLM, Q-Former, and projection layer are trained. Following (Li et al. 2024b), a total of 790K pairs are used in pre-training and 763K samples are utilized for instruction tuning. ΞΎ is set to 0.4. |