reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger

Authors: Qi Yang, Chenghao Zhang, Lubin Fan, Kun Ding, Jieping Ye, Shiming Xiang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our framework achieves state-of-the-art performance across multiple VQA datasets, significantly outperforming both In-Context Learning (ICL) and Vanilla-RAG methods.
Researcher Affiliation	Collaboration	1School of Artificial Intelligence, University of Chinese Academy of Sciences, China 2MAIS, Institute of Automation, Chinese Academy of Sciences, China 3Alibaba Cloud Computing, China.
Pseudocode	Yes	A. Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) We detail the complete workflow of MCTS below. Tree Initialization: A root node is initialized using a native user s query without any retrieved samples, generating a zero-shot response for the early stopping strategy. Node Expansion: The algorithm employs a value Q(a) and the visits times N(a) to rank all nodes that have not been fully expanded.
Open Source Code	Yes	https://github.com/yannqi/RCTS-RAG
Open Datasets	Yes	To validate the effectiveness of our proposed method, we conduct extensive experiments across multiple reasoning VQA datasets, including Science QA (Lu et al., 2022), MMMU (Yue et al., 2024), and Math V (Wang et al., 2024a). Our method also excels in non-reasoning VQA datasets such as Viz Wiz (Gurari et al., 2018) and VSR-MC (Liu et al., 2023).
Dataset Splits	Yes	Following the original splits of these VQA datasets, we construct the knowledge base with the training set and build the evaluation set with the testing set, respectively. Tab. 1 presents the size statistics of the knowledge base and the evaluation set. For Science QA, we utilize the training and validation sets, which consist of 16,967 examples, as our knowledge base. The test set, containing 4,241 examples, is employed for evaluation purposes.
Hardware Specification	Yes	For efficiency, LVLMs with over 7B parameters are implemented in 4-bit quantization by AWQ (Lin et al., 2024a) on a single 4090 24GB GPU.
Software Dependencies	No	The paper mentions using AWQ for 4-bit quantization and adapting models from Pre FLMR, but does not provide specific version numbers for any software dependencies like libraries, frameworks, or programming languages.
Experiment Setup	Yes	For the setting of multiple rounds of LVLMs generation, we set Nc = Np = 10, Ns = Nm = 5. For the setting of our MCTS-HR, we adopt the same number of few-shot samples with K = 3, i.e., a maximum tree depth of 3. The number of initial retrieval examples is set to N = 20 as the action space of MCTS-HR. The maximum width of the tree is set to 3 for more action exploration. We set the default rollouts with P = 10, and the reward weight with default α = 0.2.