reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

Authors: Weibin Liao, Xu Chu, Yasha Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We carry out extensive experiments on mathematical reasoning tasks to evaluate TPO. The experimental results indicate that TPO consistently outperforms DPO across five publicly large language models on four datasets.
Researcher Affiliation	Academia	Weibin Liao , Xu Chu , Yasha Wang School of Computer Science, Peking University Center on Frontiers of Computing Studies, Peking University National Research and Engineering Center of Software Engineering, Peking University EMAIL, EMAIL
Pseudocode	Yes	Ultimately, the overall algorithm of TPO is detailed in Appendix Alg. 1.
Open Source Code	Yes	https://github.com/MrBlankness/TPO.git
Open Datasets	Yes	The dataset was derived from the Meta Math (Yu et al., 2023), MMIQC (Liu & Yao, 2024), and AQuA (Ling et al., 2017) datasets. We introduced three types of tasks, Math (in-distribution), Coding and Reasoning (out-of-distribution), to assess the effectiveness of TPO. For the Math tasks, we considered the following datasets: MATH (Hendrycks et al., 2021), SVAMP (Patel et al., 2021), ASDiv (Miao et al., 2021) and GSM-Plus (Li et al., 2024a). For the Coding tasks, we considered the Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021) datasets. For Reasoning task, we considered the BBH (Suzgun et al., 2023) and MMLU (Hendrycks et al.) datasets.
Dataset Splits	No	Lai et al. (2024) proposed a dataset that provides 10,795 paired preference data, completely composed of mathematical problems, with complete correct and incorrect reasoning trajectories provided for each problem... We collected 10 trajectories for each problem... We introduced three types of tasks, Math (in-distribution), Coding and Reasoning (out-of-distribution), to assess the effectiveness of TPO. For the Math tasks, we considered the following datasets: MATH (Hendrycks et al., 2021), SVAMP (Patel et al., 2021), ASDiv (Miao et al., 2021) and GSM-Plus (Li et al., 2024a). (Explanation: The paper describes the generation of training data and lists evaluation datasets, but it does not specify explicit training/validation/test splits, such as percentages or sample counts for its preference data, nor does it explicitly state the specific splits used for the listed benchmark evaluation datasets.)
Hardware Specification	Yes	The experiments were conducted on 8 NVIDIA-RTX3090-24GB GPUs.
Software Dependencies	No	We used the PyTorch library to implement all the algorithms based on the open-source Hugging Face transformers (Wolf, 2019) and Transformer Reinforcement Learning (TRL) (von Werra et al., 2020). (Explanation: The paper mentions PyTorch, Hugging Face transformers, and TRL but does not specify their version numbers.)
Experiment Setup	Yes	For each experimental setup, we trained the model for 1 epoch, using a batch size of 1 for each GPU. The learning rate was set to 5e-7. The hyperparameter β used in Eq. 4 for DPO was set to 0.5. We utilized the AdamW optimizer and a cosine learning rate scheduler, with a warm-up ratio set to 0.1.