reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Divide and Conquer: Exploring Language-centric Tree Reasoning for Video Question-Answering

Authors: Zhaohe Liao, Jiangtong Li, Siyu Sun, Qingyang Liu, Fengshun Xiao, Tianjiao Li, Qiang Zhang, Guang Chen, Li Niu, Changjun Jiang, Liqing Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across 11 Video QA benchmarks demonstrate that our LTR framework significantly improves both accuracy and interpretability compared to state-of-the-art MLLMs. To validate the effectiveness of our LTR framework, we select four existing MLLMs... and conduct experiments on 11 Video QA benchmarks. Additionally, we perform ablation studies to analyze the effectiveness of each component... Moreover, we provide case studies...
Researcher Affiliation	Collaboration	1 Mo E Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai China; 2 Key Laboratory of Embedded System and Service Computing, Ministry of Education, Shanghai, China; 3 School of Computer Science and Technology, Tongji University, Shanghai, China; 4 Bilibili Inc, Shanghai, China.
Pseudocode	No	The paper describes the 'Divide with Top-down Recursive Checking' and 'Conquer with Bottom-up Tree Reasoning' stages with detailed textual explanations and illustrations (Figures 1 and 2), but it does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code	No	The paper does not contain an explicit statement by the authors that they are releasing their code, nor does it provide a direct link to a source-code repository for the methodology described in this paper.
Open Datasets	Yes	We evaluate the LTR framework on 11 Video QA benchmarks, including MSVD-QA (Xu et al., 2016), MSRVTTQA (Xu et al., 2016), TGIF-QA (Jang et al., 2017), Activity Net-QA (Yu et al., 2019), AGQA-Decomp (Gandhi et al., 2022), NEx T-QA (Xiao et al., 2021), Causal Vid QA (Li et al., 2022a), STAR (Wu et al., 2023), Ego Schema (Mangalam et al., 2023), Video-MME (Fu et al., 2024), and MVBench (Li et al., 2024b).
Dataset Splits	Yes	To illustrate improvements in compositional consistency, we utilize the DAG from the AGQA-Decomp test set for bottom-up tree reasoning. Regarding accuracy improvement, we observe more pronounced gains in sub-question compared to main questions. This is attributed to the relative simplicity of sub-questions, which facilitates more effective reasoning. Furthermore, the improvement in c F1 is much larger than that in accuracy. This improvement is attributed to the Video-aided Logical Reasoning module, which exploits the logical relationships within the structure, enabeling the QA inforamtion in perceptual questions to propagate along the tree and help the model answer more cognitive questions, therefore enhances the compositional consistency between main and sub questions. ... MVBench: A comprehensive multi-modal video understanding benchmark.
Hardware Specification	No	The paper does not specify any particular hardware components such as GPU models, CPU models, or memory used for conducting the experiments. It mentions input/output settings like video resolution and token length, but no hardware.
Software Dependencies	No	The paper mentions using 'Open AI s text-embedding-3-large model' and various MLLMs like 'Video LLa MA3', 'Video Chat2', 'Qwen2-VL', and 'LLa VA-One Vision', but does not provide specific version numbers for these or any other software dependencies such as programming languages, libraries, or operating systems.
Experiment Setup	Yes	In our LTR framework, we utilize four different MLLMs: Video LLa MA3 (Zhang et al., 2025), Video Chat2 (Li et al., 2024b), Qwen2-VL (Wang et al., 2024), and LLa VA-One Vision (Li et al., 2024a). We set the video resolution to 336 336 pixels and uniformly sample 16 frames from each video. The maximun new generated length is restricted to 2048 tokens. Other settings follow the recommended settings of zero-shot generations for each baseline model.