reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Language Models as Implicit Tree Search

Authors: Ziliang Chen, Zhao-Rong Lai, Yufeng Yang, Liangda Fang, Zhanfu Yang, Liang Lin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that our methodology outperforms both regular DPO variants in human preference alignment, and MCTS-based LMs in mathematical reasoning and planning tasks. Our experiments included the evaluation across human preference alignment, mathematical reasoning, and mathematical planning, where our approach concurrently reaped the optima against DPO variants and MCTS-derived baselines. In this section, we demonstrate the superiority of IT-PO from the step-synchronous (Theorem.4.2) and step-asynchronous (Theorem.4.3) perspectives.
Researcher Affiliation	Academia	1Research Institute of Multiple Agents and Embodied Intelligence, Peng Cheng Laboratory; 2Jinan University, Guangzhou, China; 3School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China; 4Department of Computer Science, Rutgers University. Correspondence to: Liang Lin <EMAIL>.
Pseudocode	Yes	Algorithm 1 The algorithm pipeline of IT-PO
Open Source Code	No	The paper does not contain any explicit statement about open-sourcing the code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	Yes	Anthropic HH dataset (Bai et al., 2022)1 consists of 170k dialogues... 1https://huggingface.co/datasets/Anthropic/hh-rlhf mathematical reasoning on GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021) and mathematical planning on Game24 (Cobbe et al., 2021). Proof Writer (Tafjord et al., 2020) for deductive logical reasoning, and Chess Endgame (Abdulhai et al., 2023) for long-term multi-turn decision making.
Dataset Splits	Yes	The experiment primarily aims for three evaluation metrics: 1). Accuracy: we adopt the evaluation split in (Zeng et al., 2024a) to train all models then evaluate their performance in terms of the accuracy on the generated responses relative to chosen completions in the test dataset. Table 5. Task setups... GSM8k Mathematical Reasoning 7.5k / 1.3k... Game24 Mathematical Planning 1.0k / 0.3k... For Proof Writer, we follow (Pan et al.) to generate the test set, then the rest are merged to 41,433 training instances.
Hardware Specification	No	The paper mentions the base models used (e.g., Pythia 2.8, LLAMA-7b, Qwen1.5-32B) but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies	No	The paper discusses various algorithms and language models (e.g., DPO, cDPO, Pythia 2.8, LLAMA-7b) but does not list specific software dependencies such as libraries or frameworks with their version numbers (e.g., PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	We set ϵ=0 to human alignment task, yet set ϵ=0.25 to mathematical reasoning and planning tasks. We set K = 8, inspired from the number of sampled responses for each prompt in many RLHF implementations. all fine-tuning methods only run for a single epoch. Table 5. Task setups. The node, tree max width, and tree max depth are search space parameters.