Divide and Conquer: Exploring Language-centric Tree Reasoning for Video Question-Answering
Authors: Zhaohe Liao, Jiangtong Li, Siyu Sun, Qingyang Liu, Fengshun Xiao, Tianjiao Li, Qiang Zhang, Guang Chen, Li Niu, Changjun Jiang, Liqing Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across 11 Video QA benchmarks demonstrate that our LTR framework significantly improves both accuracy and interpretability compared to state-of-the-art MLLMs. To validate the effectiveness of our LTR framework, we select four existing MLLMs... and conduct experiments on 11 Video QA benchmarks. Additionally, we perform ablation studies to analyze the effectiveness of each component... Moreover, we provide case studies... |
| Researcher Affiliation | Collaboration | 1 Mo E Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai China; 2 Key Laboratory of Embedded System and Service Computing, Ministry of Education, Shanghai, China; 3 School of Computer Science and Technology, Tongji University, Shanghai, China; 4 Bilibili Inc, Shanghai, China. |
| Pseudocode | No | The paper describes the 'Divide with Top-down Recursive Checking' and 'Conquer with Bottom-up Tree Reasoning' stages with detailed textual explanations and illustrations (Figures 1 and 2), but it does not include any explicitly labeled pseudocode blocks or algorithms. |
| Open Source Code | No | The paper does not contain an explicit statement by the authors that they are releasing their code, nor does it provide a direct link to a source-code repository for the methodology described in this paper. |
| Open Datasets | Yes | We evaluate the LTR framework on 11 Video QA benchmarks, including MSVD-QA (Xu et al., 2016), MSRVTTQA (Xu et al., 2016), TGIF-QA (Jang et al., 2017), Activity Net-QA (Yu et al., 2019), AGQA-Decomp (Gandhi et al., 2022), NEx T-QA (Xiao et al., 2021), Causal Vid QA (Li et al., 2022a), STAR (Wu et al., 2023), Ego Schema (Mangalam et al., 2023), Video-MME (Fu et al., 2024), and MVBench (Li et al., 2024b). |
| Dataset Splits | Yes | To illustrate improvements in compositional consistency, we utilize the DAG from the AGQA-Decomp test set for bottom-up tree reasoning. Regarding accuracy improvement, we observe more pronounced gains in sub-question compared to main questions. This is attributed to the relative simplicity of sub-questions, which facilitates more effective reasoning. Furthermore, the improvement in c F1 is much larger than that in accuracy. This improvement is attributed to the Video-aided Logical Reasoning module, which exploits the logical relationships within the structure, enabeling the QA inforamtion in perceptual questions to propagate along the tree and help the model answer more cognitive questions, therefore enhances the compositional consistency between main and sub questions. ... MVBench: A comprehensive multi-modal video understanding benchmark. |
| Hardware Specification | No | The paper does not specify any particular hardware components such as GPU models, CPU models, or memory used for conducting the experiments. It mentions input/output settings like video resolution and token length, but no hardware. |
| Software Dependencies | No | The paper mentions using 'Open AI s text-embedding-3-large model' and various MLLMs like 'Video LLa MA3', 'Video Chat2', 'Qwen2-VL', and 'LLa VA-One Vision', but does not provide specific version numbers for these or any other software dependencies such as programming languages, libraries, or operating systems. |
| Experiment Setup | Yes | In our LTR framework, we utilize four different MLLMs: Video LLa MA3 (Zhang et al., 2025), Video Chat2 (Li et al., 2024b), Qwen2-VL (Wang et al., 2024), and LLa VA-One Vision (Li et al., 2024a). We set the video resolution to 336 336 pixels and uniformly sample 16 frames from each video. The maximun new generated length is restricted to 2048 tokens. Other settings follow the recommended settings of zero-shot generations for each baseline model. |