reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Granularity-Adaptive Spatial Evidence Tokenization for Video Question Answering

Authors: Hao Jiang, Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Yang Song, Kun Gai, Yadong Mu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on 11 mainstream video question answering datasets and the experimental results demonstrate the effectiveness of our proposed method.
Researcher Affiliation	Collaboration	1Peking University 2Kuaishou Technology EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using natural language and mathematical equations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release the code of this work to facilitate future work1. 1https://code-website.wixsite.com/videoqa.
Open Datasets	Yes	We evaluate the model performance on 11 mainstream video question answering datasets, including 4 open-ended question answering benchmarks (Xu et al. 2017; Yu et al. 2019; Li et al. 2016), a text generation benchmark (Maaz et al. 2024), 5 multiple-choice benchmarks (Xiao et al. 2021; Wu et al. 2023; Lei et al. 2018; Li et al. 2023a; Mangalam, Akshulakov, and Malik 2023), and a recently proposed comprehensive video understanding benchmark (Li et al. 2024a).
Dataset Splits	No	The paper states: "Following (Li et al. 2024b), a total of 790K pairs are used in pre-training and 763K samples are utilized for instruction tuning." While this provides training data counts, it does not specify the explicit training, validation, and test splits (e.g., percentages or exact counts for each split) for the 11 mainstream video question answering datasets used for evaluation.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running its experiments, such as GPU models, CPU types, or memory.
Software Dependencies	No	The paper mentions models like "Vicuna-7B (Chiang et al. 2023)", "EVA-G (Sun et al. 2023)", and "QFormer in Instruct BLIP (Dai et al. 2023)" but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We use Vicuna-7B (Chiang et al. 2023) as the LLM, and the visual encoder is EVA-G (Sun et al. 2023). Pre-trained QFormer in Instruct BLIP (Dai et al. 2023) is employed for feature fusion between frames and questions. wl and wh are 224 and 448 respectively. Two-layer MLPs are used to project visual tokens into LLM semantic space. During the pre-training stage, only the projection layer is trained, while in the instruction tuning phase, LLM, Q-Former, and projection layer are trained. Following (Li et al. 2024b), a total of 790K pairs are used in pre-training and 763K samples are utilized for instruction tuning. ξ is set to 0.4.