Understanding Complexity in VideoQA via Visual Program Generation

Authors: Cristobal Eyzaguirre, Igor Vasiljevic, Achal Dave, Jiajun Wu, Rares Andrei Ambrus, Thomas Kollar, Juan Carlos Niebles, Pavel Tokmakov

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate several zero-shot Video QA methods on the resulting benchmark and observe a 1.9 gap in performance compared to existing datasets like NEx T-QA (Xiao et al., 2021). ... Table 1. Comparison of question complexity metrics using m PEG on the validation set of NEx T-QA. ... Further experiments that isolate and evaluate specific components of our pipeline can be found in the appendix. In particular, in Section 10.1 we validate our question selection algorithm. Similarly, in Section 10.2 we isolate the impact of video source on dataset difficulty, confirming the effectiveness of our question generation approach.
Researcher Affiliation Collaboration 1Stanford Computer Science 2Toyota Research Institute. Correspondence to: Crist obal Eyzaguirre <EMAIL>.
Pseudocode Yes middle_frame = video.get_frame(video.num_frames // 2) middle_caption = middle_frame.caption() location = middle_frame.classify_location() answer = answer_question(question, middle_caption, location) for frame in video: if frame.simple_qa("is the dog letting go of the bone?"): let_go_started = True elif let_go_started: frame_after_started = frame break description = frame after started.caption() ... Figure 2. Estimating question complexity via code. Our approach to estimating question complexity involves converting the question into code, decomposing the pseudo-code into abstract syntax subtrees (Si), before correlating subtree presence with model performance.
Open Source Code Yes Finally, we release code, models, and other materials at ceyzaguirre4.github.io/codeplexity.
Open Datasets Yes We focus on the NEx T-QA (Xiao et al., 2021) benchmark for its size, variety of human-annotated questions, and its focus on spatio-temporal reasoning in videos over mere visual-fact retrieval (Zhong et al., 2022). ... We generate questions using 3 different datasets, all of which provide scene-graphs annotations: MOMA (Luo et al., 2021; 2022); Activity Net (Caba Heilbron et al., 2015), which we combine with Activity Net Entities (Zhou et al., 2019) and Activity Net-Captions (Krishna et al., 2017), and the Action Genome (Ji et al., 2020) annotations for Charades (Sigurdsson et al., 2016).
Dataset Splits Yes We perform the evaluation on the validation set, further splitting the questions into 80% used to train the metrics and the other 20% held out for computing m PEG.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies Yes We leverage a state-of-the-art image captioning model, LLa VA 1.5 (Liu et al., 2023a;b), to list visual attributes of the main actors and objects in the videos.
Experiment Setup Yes We set the sampling temperature to zero and decode greedily (for replicability). ... We set the δ in Equation 8 to select the top 10% of the data according to the estimated complexity (calibrated on NEx T-QA). ... The resulting model uses L2 regularization with weight c = 1.0 and is trained with the L-BFGS solver (Byrd et al., 1995).