Language Models as Implicit Tree Search

Authors: Ziliang Chen, Zhao-Rong Lai, Yufeng Yang, Liangda Fang, Zhanfu Yang, Liang Lin

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that our methodology outperforms both regular DPO variants in human preference alignment, and MCTS-based LMs in mathematical reasoning and planning tasks. Our experiments included the evaluation across human preference alignment, mathematical reasoning, and mathematical planning, where our approach concurrently reaped the optima against DPO variants and MCTS-derived baselines. In this section, we demonstrate the superiority of IT-PO from the step-synchronous (Theorem.4.2) and step-asynchronous (Theorem.4.3) perspectives.
Researcher Affiliation Academia 1Research Institute of Multiple Agents and Embodied Intelligence, Peng Cheng Laboratory; 2Jinan University, Guangzhou, China; 3School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China; 4Department of Computer Science, Rutgers University. Correspondence to: Liang Lin <EMAIL>.
Pseudocode Yes Algorithm 1 The algorithm pipeline of IT-PO
Open Source Code No The paper does not contain any explicit statement about open-sourcing the code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes Anthropic HH dataset (Bai et al., 2022)1 consists of 170k dialogues... 1https://huggingface.co/datasets/Anthropic/hh-rlhf mathematical reasoning on GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021) and mathematical planning on Game24 (Cobbe et al., 2021). Proof Writer (Tafjord et al., 2020) for deductive logical reasoning, and Chess Endgame (Abdulhai et al., 2023) for long-term multi-turn decision making.
Dataset Splits Yes The experiment primarily aims for three evaluation metrics: 1). Accuracy: we adopt the evaluation split in (Zeng et al., 2024a) to train all models then evaluate their performance in terms of the accuracy on the generated responses relative to chosen completions in the test dataset. Table 5. Task setups... GSM8k Mathematical Reasoning 7.5k / 1.3k... Game24 Mathematical Planning 1.0k / 0.3k... For Proof Writer, we follow (Pan et al.) to generate the test set, then the rest are merged to 41,433 training instances.
Hardware Specification No The paper mentions the base models used (e.g., Pythia 2.8, LLAMA-7b, Qwen1.5-32B) but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper discusses various algorithms and language models (e.g., DPO, cDPO, Pythia 2.8, LLAMA-7b) but does not list specific software dependencies such as libraries or frameworks with their version numbers (e.g., PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes We set ϵ=0 to human alignment task, yet set ϵ=0.25 to mathematical reasoning and planning tasks. We set K = 8, inspired from the number of sampled responses for each prompt in many RLHF implementations. all fine-tuning methods only run for a single epoch. Table 5. Task setups. The node, tree max width, and tree max depth are search space parameters.