Divide and Conquer: Exploring Language-centric Tree Reasoning for Video Question-Answering

Authors: Zhaohe Liao, Jiangtong Li, Siyu Sun, Qingyang Liu, Fengshun Xiao, Tianjiao Li, Qiang Zhang, Guang Chen, Li Niu, Changjun Jiang, Liqing Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across 11 Video QA benchmarks demonstrate that our LTR framework significantly improves both accuracy and interpretability compared to state-of-the-art MLLMs. To validate the effectiveness of our LTR framework, we select four existing MLLMs... and conduct experiments on 11 Video QA benchmarks. Additionally, we perform ablation studies to analyze the effectiveness of each component... Moreover, we provide case studies...
Researcher Affiliation Collaboration 1 Mo E Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai China; 2 Key Laboratory of Embedded System and Service Computing, Ministry of Education, Shanghai, China; 3 School of Computer Science and Technology, Tongji University, Shanghai, China; 4 Bilibili Inc, Shanghai, China.
Pseudocode No The paper describes the 'Divide with Top-down Recursive Checking' and 'Conquer with Bottom-up Tree Reasoning' stages with detailed textual explanations and illustrations (Figures 1 and 2), but it does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code No The paper does not contain an explicit statement by the authors that they are releasing their code, nor does it provide a direct link to a source-code repository for the methodology described in this paper.
Open Datasets Yes We evaluate the LTR framework on 11 Video QA benchmarks, including MSVD-QA (Xu et al., 2016), MSRVTTQA (Xu et al., 2016), TGIF-QA (Jang et al., 2017), Activity Net-QA (Yu et al., 2019), AGQA-Decomp (Gandhi et al., 2022), NEx T-QA (Xiao et al., 2021), Causal Vid QA (Li et al., 2022a), STAR (Wu et al., 2023), Ego Schema (Mangalam et al., 2023), Video-MME (Fu et al., 2024), and MVBench (Li et al., 2024b).
Dataset Splits Yes To illustrate improvements in compositional consistency, we utilize the DAG from the AGQA-Decomp test set for bottom-up tree reasoning. Regarding accuracy improvement, we observe more pronounced gains in sub-question compared to main questions. This is attributed to the relative simplicity of sub-questions, which facilitates more effective reasoning. Furthermore, the improvement in c F1 is much larger than that in accuracy. This improvement is attributed to the Video-aided Logical Reasoning module, which exploits the logical relationships within the structure, enabeling the QA inforamtion in perceptual questions to propagate along the tree and help the model answer more cognitive questions, therefore enhances the compositional consistency between main and sub questions. ... MVBench: A comprehensive multi-modal video understanding benchmark.
Hardware Specification No The paper does not specify any particular hardware components such as GPU models, CPU models, or memory used for conducting the experiments. It mentions input/output settings like video resolution and token length, but no hardware.
Software Dependencies No The paper mentions using 'Open AI s text-embedding-3-large model' and various MLLMs like 'Video LLa MA3', 'Video Chat2', 'Qwen2-VL', and 'LLa VA-One Vision', but does not provide specific version numbers for these or any other software dependencies such as programming languages, libraries, or operating systems.
Experiment Setup Yes In our LTR framework, we utilize four different MLLMs: Video LLa MA3 (Zhang et al., 2025), Video Chat2 (Li et al., 2024b), Qwen2-VL (Wang et al., 2024), and LLa VA-One Vision (Li et al., 2024a). We set the video resolution to 336 336 pixels and uniformly sample 16 frames from each video. The maximun new generated length is restricted to 2048 tokens. Other settings follow the recommended settings of zero-shot generations for each baseline model.