Graph Prompts: Adapting Video Graph for Video Question Answering

Authors: Yiming Li, Xiaoshan Yang, Bing-Kun Bao, Changsheng Xu

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on various datasets have demonstrated the promising performance of GP-VQA. 4 Experiments Dataset: We evaluate our model on three recently proposed challenging datasets for the long-form Video QA, namely AGQA v2 [Grunde-Mc Laughlin et al., 2022], NEx TQA [Xiao et al., 2021], STAR[Wu et al., 2021]. 4.1 Comparison with State-of-the-arts 4.2 Ablation Study
Researcher Affiliation Academia Yiming Li1,5 , Xiaoshan Yang2,3,4 , Bing-Kun Bao1,4 and Changsheng Xu2,3,4 1Nanjing University of Posts and Telecommunications 2Institute of Automation, Chinese Academy of Sciences 3School of Artificial Intelligence, University of Chinese Academy of Sciences 4Pengcheng Laboratory 5State Key Laboratory of Tibetan Intelligence EMAIL, EMAIL All listed affiliations (Nanjing University of Posts and Telecommunications, Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Pengcheng Laboratory, State Key Laboratory of Tibetan Intelligence) are academic institutions or public research labs, and the email domains (.edu.cn, .ia.ac.cn) are academic/research oriented. There are no corporate affiliations.
Pseudocode No The paper describes the methodology using natural language and mathematical equations (e.g., Eq. 1, 2, 3, 4, 5, 6, 7, 8) and illustrates the framework in Figure 2, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing the source code for the GP-VQA method, nor does it provide a link to a code repository. It mentions using existing models like LLAVA-7B and Qianwen2-VL-7B but not its own implementation code.
Open Datasets Yes Dataset: We evaluate our model on three recently proposed challenging datasets for the long-form Video QA, namely AGQA v2 [Grunde-Mc Laughlin et al., 2022], NEx TQA [Xiao et al., 2021], STAR[Wu et al., 2021]. ... Moreover, the adopted video scene graph generation models are pre-trained on Action Genome [Ji et al., 2020].
Dataset Splits No The paper states, "We sample 1 frame in every 3 frames for pre-training." However, it does not provide specific train/validation/test split percentages, sample counts, or references to standard splits for the AGQA v2, NEx T-QA, and STAR datasets. While these are benchmark datasets, the paper does not specify how the data was partitioned for their experiments.
Hardware Specification No The paper mentions using "Mask RCNN [He et al., 2017] with a Res Net-101 backbone as the object detector" and "the frozen LLAVA-7B and Qianwen2-VL-7B with Lo RA" as vision-language models. However, it does not specify any particular hardware used to run the experiments, such as GPU models, CPU types, or memory configurations.
Software Dependencies No The paper mentions several software components like "Mask RCNN," "Res Net-101 backbone," "LLAVA-7B," "Qianwen2-VL-7B," "Lo RA," "Glove [Pennington et al., 2014]," "Stanford Core NLP [Manning et al., 2014]," "Transformer [Vaswani et al., 2017]," and "SGD optimizer." However, it does not provide specific version numbers for these software dependencies or libraries, which are crucial for reproducibility.
Experiment Setup Yes During the pretraining stage, we use SGD optimizer with an initial learning rate of 0.001 and decay the learning rate by multiplying it with 0.9 after every epoch. The momentum is set to 0.9 and the size of mini-batch is set to 8. For hyper-parameters, we set the random mask rate λ to 0.08, while the α, β, and γ in prompts generation are set to 0.7, 0.8, and 2 respectively. Moreover, we sample 1 frame in every 3 frames for pre-training. For prompt-tuning, we use the same setting as pre-training, except the initial learning rate is 1e 5.