Understanding Complexity in VideoQA via Visual Program Generation
Authors: Cristobal Eyzaguirre, Igor Vasiljevic, Achal Dave, Jiajun Wu, Rares Andrei Ambrus, Thomas Kollar, Juan Carlos Niebles, Pavel Tokmakov
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate several zero-shot Video QA methods on the resulting benchmark and observe a 1.9 gap in performance compared to existing datasets like NEx T-QA (Xiao et al., 2021). ... Table 1. Comparison of question complexity metrics using m PEG on the validation set of NEx T-QA. ... Further experiments that isolate and evaluate specific components of our pipeline can be found in the appendix. In particular, in Section 10.1 we validate our question selection algorithm. Similarly, in Section 10.2 we isolate the impact of video source on dataset difficulty, confirming the effectiveness of our question generation approach. |
| Researcher Affiliation | Collaboration | 1Stanford Computer Science 2Toyota Research Institute. Correspondence to: Crist obal Eyzaguirre <EMAIL>. |
| Pseudocode | Yes | middle_frame = video.get_frame(video.num_frames // 2) middle_caption = middle_frame.caption() location = middle_frame.classify_location() answer = answer_question(question, middle_caption, location) for frame in video: if frame.simple_qa("is the dog letting go of the bone?"): let_go_started = True elif let_go_started: frame_after_started = frame break description = frame after started.caption() ... Figure 2. Estimating question complexity via code. Our approach to estimating question complexity involves converting the question into code, decomposing the pseudo-code into abstract syntax subtrees (Si), before correlating subtree presence with model performance. |
| Open Source Code | Yes | Finally, we release code, models, and other materials at ceyzaguirre4.github.io/codeplexity. |
| Open Datasets | Yes | We focus on the NEx T-QA (Xiao et al., 2021) benchmark for its size, variety of human-annotated questions, and its focus on spatio-temporal reasoning in videos over mere visual-fact retrieval (Zhong et al., 2022). ... We generate questions using 3 different datasets, all of which provide scene-graphs annotations: MOMA (Luo et al., 2021; 2022); Activity Net (Caba Heilbron et al., 2015), which we combine with Activity Net Entities (Zhou et al., 2019) and Activity Net-Captions (Krishna et al., 2017), and the Action Genome (Ji et al., 2020) annotations for Charades (Sigurdsson et al., 2016). |
| Dataset Splits | Yes | We perform the evaluation on the validation set, further splitting the questions into 80% used to train the metrics and the other 20% held out for computing m PEG. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | Yes | We leverage a state-of-the-art image captioning model, LLa VA 1.5 (Liu et al., 2023a;b), to list visual attributes of the main actors and objects in the videos. |
| Experiment Setup | Yes | We set the sampling temperature to zero and decode greedily (for replicability). ... We set the δ in Equation 8 to select the top 10% of the data according to the estimated complexity (calibrated on NEx T-QA). ... The resulting model uses L2 regularization with weight c = 1.0 and is trained with the L-BFGS solver (Byrd et al., 1995). |