reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Understanding Complexity in VideoQA via Visual Program Generation

Authors: Cristobal Eyzaguirre, Igor Vasiljevic, Achal Dave, Jiajun Wu, Rares Andrei Ambrus, Thomas Kollar, Juan Carlos Niebles, Pavel Tokmakov

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate several zero-shot Video QA methods on the resulting benchmark and observe a 1.9 gap in performance compared to existing datasets like NEx T-QA (Xiao et al., 2021). ... Table 1. Comparison of question complexity metrics using m PEG on the validation set of NEx T-QA. ... Further experiments that isolate and evaluate specific components of our pipeline can be found in the appendix. In particular, in Section 10.1 we validate our question selection algorithm. Similarly, in Section 10.2 we isolate the impact of video source on dataset difficulty, confirming the effectiveness of our question generation approach.
Researcher Affiliation	Collaboration	1Stanford Computer Science 2Toyota Research Institute. Correspondence to: Crist obal Eyzaguirre <EMAIL>.
Pseudocode	Yes	middle_frame = video.get_frame(video.num_frames // 2) middle_caption = middle_frame.caption() location = middle_frame.classify_location() answer = answer_question(question, middle_caption, location) for frame in video: if frame.simple_qa("is the dog letting go of the bone?"): let_go_started = True elif let_go_started: frame_after_started = frame break description = frame after started.caption() ... Figure 2. Estimating question complexity via code. Our approach to estimating question complexity involves converting the question into code, decomposing the pseudo-code into abstract syntax subtrees (Si), before correlating subtree presence with model performance.
Open Source Code	Yes	Finally, we release code, models, and other materials at ceyzaguirre4.github.io/codeplexity.
Open Datasets	Yes	We focus on the NEx T-QA (Xiao et al., 2021) benchmark for its size, variety of human-annotated questions, and its focus on spatio-temporal reasoning in videos over mere visual-fact retrieval (Zhong et al., 2022). ... We generate questions using 3 different datasets, all of which provide scene-graphs annotations: MOMA (Luo et al., 2021; 2022); Activity Net (Caba Heilbron et al., 2015), which we combine with Activity Net Entities (Zhou et al., 2019) and Activity Net-Captions (Krishna et al., 2017), and the Action Genome (Ji et al., 2020) annotations for Charades (Sigurdsson et al., 2016).
Dataset Splits	Yes	We perform the evaluation on the validation set, further splitting the questions into 80% used to train the metrics and the other 20% held out for computing m PEG.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies	Yes	We leverage a state-of-the-art image captioning model, LLa VA 1.5 (Liu et al., 2023a;b), to list visual attributes of the main actors and objects in the videos.
Experiment Setup	Yes	We set the sampling temperature to zero and decode greedily (for replicability). ... We set the δ in Equation 8 to select the top 10% of the data according to the estimated complexity (calibrated on NEx T-QA). ... The resulting model uses L2 regularization with weight c = 1.0 and is trained with the L-BFGS solver (Byrd et al., 1995).