reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking

Authors: Runquan Gui, Zhihai Wang, Jie Wang, Chi Ma, Huiling Zhen, Mingxuan Yuan, Jianye Hao, Defu Lian, Enhong Chen, Feng Wu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate the effectiveness of HTP, achieving state-of-the-art accuracy on the Travel Planner benchmark with Gemini-1.5-Pro, resulting in a 3.6 performance improvement over o1-preview. 5. Experiments 5.1. Setups Benchmarks To evaluate the effectiveness of our method, we select three of the most challenging planning datasets: Travel Planner (Xie et al., 2024), Plan Bench (Valmeekam et al., 2024a) and Natural Plan (Zheng et al., 2024). 5.2. Main Results As shown in Table 1, we evaluate HTP s effectiveness across these benchmarks. 5.4. Ablation Study and Additional Analysis Ablations To assess the impact of individual HTP modules on overall performance, we conduct an ablation study using GPT-4o and Gemini-1.5-Pro as the backbone models.
Researcher Affiliation	Collaboration	1Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China 3Noah s Ark Lab, Huawei Technologies 4College of Intelligence and Computing, Tianjin University 5State Key Laboratory of Cognitive Intelligence & University of Science and Technology of China.
Pseudocode	Yes	Algorithm 1 Top-down Hyper Tree Construction Algorithm Input: rules R, query q, LLM πθ, reasoning depth K, expansion width W Convert divisible set: D Convert(R) Initialize hypertree: H q for d 1 to K do
Open Source Code	No	The paper does not contain an explicit statement about open-sourcing the code or a link to a code repository for the methodology described.
Open Datasets	Yes	To evaluate the effectiveness of our method, we select three of the most challenging planning datasets: Travel Planner (Xie et al., 2024), Plan Bench (Valmeekam et al., 2024a) and Natural Plan (Zheng et al., 2024).
Dataset Splits	Yes	1) Travel Planner is a planning benchmark focused on travel planning, aiming to find an itinerary that satisfies diverse constraints regarding flights, accommodations, and other travel arrangements. In this study, we select the validation set for evaluation, which contains 180 queries and is divided into 9 groups based on difficulty levels (easy, medium and hard) and trip durations (3, 5, and 7 days).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments.
Software Dependencies	Yes	For Open AI models, we use gpt-3.5-turbo-1106 and gpt-4o-2024-08-06. For Gemini-1.5-Pro, we use Google Gemini-1.5-Pro APIs to obtain results. We set the temperature to 0 for all models.
Experiment Setup	Yes	We set the temperature to 0 for all models. To effectively select the optimal hyperchains from H and manage their number, inspired by the tree-structured methods for limiting width, we adopt three strategies: a widthbased pruning method, which restricts the total number of branches; a probability-based pruning method, where hyperchains with low confidence probabilities generated by the LLM during branching are eliminated; and an LLM-guided evaluation method, which leverages the LLM to filter and assess candidate hyperchains.