Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger
Authors: Qi Yang, Chenghao Zhang, Lubin Fan, Kun Ding, Jieping Ye, Shiming Xiang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our framework achieves state-of-the-art performance across multiple VQA datasets, significantly outperforming both In-Context Learning (ICL) and Vanilla-RAG methods. |
| Researcher Affiliation | Collaboration | 1School of Artificial Intelligence, University of Chinese Academy of Sciences, China 2MAIS, Institute of Automation, Chinese Academy of Sciences, China 3Alibaba Cloud Computing, China. |
| Pseudocode | Yes | A. Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) We detail the complete workflow of MCTS below. Tree Initialization: A root node is initialized using a native user s query without any retrieved samples, generating a zero-shot response for the early stopping strategy. Node Expansion: The algorithm employs a value Q(a) and the visits times N(a) to rank all nodes that have not been fully expanded. |
| Open Source Code | Yes | https://github.com/yannqi/RCTS-RAG |
| Open Datasets | Yes | To validate the effectiveness of our proposed method, we conduct extensive experiments across multiple reasoning VQA datasets, including Science QA (Lu et al., 2022), MMMU (Yue et al., 2024), and Math V (Wang et al., 2024a). Our method also excels in non-reasoning VQA datasets such as Viz Wiz (Gurari et al., 2018) and VSR-MC (Liu et al., 2023). |
| Dataset Splits | Yes | Following the original splits of these VQA datasets, we construct the knowledge base with the training set and build the evaluation set with the testing set, respectively. Tab. 1 presents the size statistics of the knowledge base and the evaluation set. For Science QA, we utilize the training and validation sets, which consist of 16,967 examples, as our knowledge base. The test set, containing 4,241 examples, is employed for evaluation purposes. |
| Hardware Specification | Yes | For efficiency, LVLMs with over 7B parameters are implemented in 4-bit quantization by AWQ (Lin et al., 2024a) on a single 4090 24GB GPU. |
| Software Dependencies | No | The paper mentions using AWQ for 4-bit quantization and adapting models from Pre FLMR, but does not provide specific version numbers for any software dependencies like libraries, frameworks, or programming languages. |
| Experiment Setup | Yes | For the setting of multiple rounds of LVLMs generation, we set Nc = Np = 10, Ns = Nm = 5. For the setting of our MCTS-HR, we adopt the same number of few-shot samples with K = 3, i.e., a maximum tree depth of 3. The number of initial retrieval examples is set to N = 20 as the action space of MCTS-HR. The maximum width of the tree is set to 3 for more action exploration. We set the default rollouts with P = 10, and the reward weight with default α = 0.2. |