TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees
Authors: Weibin Liao, Xu Chu, Yasha Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We carry out extensive experiments on mathematical reasoning tasks to evaluate TPO. The experimental results indicate that TPO consistently outperforms DPO across five publicly large language models on four datasets. |
| Researcher Affiliation | Academia | Weibin Liao , Xu Chu , Yasha Wang School of Computer Science, Peking University Center on Frontiers of Computing Studies, Peking University National Research and Engineering Center of Software Engineering, Peking University EMAIL, EMAIL |
| Pseudocode | Yes | Ultimately, the overall algorithm of TPO is detailed in Appendix Alg. 1. |
| Open Source Code | Yes | https://github.com/MrBlankness/TPO.git |
| Open Datasets | Yes | The dataset was derived from the Meta Math (Yu et al., 2023), MMIQC (Liu & Yao, 2024), and AQuA (Ling et al., 2017) datasets. We introduced three types of tasks, Math (in-distribution), Coding and Reasoning (out-of-distribution), to assess the effectiveness of TPO. For the Math tasks, we considered the following datasets: MATH (Hendrycks et al., 2021), SVAMP (Patel et al., 2021), ASDiv (Miao et al., 2021) and GSM-Plus (Li et al., 2024a). For the Coding tasks, we considered the Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021) datasets. For Reasoning task, we considered the BBH (Suzgun et al., 2023) and MMLU (Hendrycks et al.) datasets. |
| Dataset Splits | No | Lai et al. (2024) proposed a dataset that provides 10,795 paired preference data, completely composed of mathematical problems, with complete correct and incorrect reasoning trajectories provided for each problem... We collected 10 trajectories for each problem... We introduced three types of tasks, Math (in-distribution), Coding and Reasoning (out-of-distribution), to assess the effectiveness of TPO. For the Math tasks, we considered the following datasets: MATH (Hendrycks et al., 2021), SVAMP (Patel et al., 2021), ASDiv (Miao et al., 2021) and GSM-Plus (Li et al., 2024a). (Explanation: The paper describes the generation of training data and lists evaluation datasets, but it does not specify explicit training/validation/test splits, such as percentages or sample counts for its preference data, nor does it explicitly state the specific splits used for the listed benchmark evaluation datasets.) |
| Hardware Specification | Yes | The experiments were conducted on 8 NVIDIA-RTX3090-24GB GPUs. |
| Software Dependencies | No | We used the PyTorch library to implement all the algorithms based on the open-source Hugging Face transformers (Wolf, 2019) and Transformer Reinforcement Learning (TRL) (von Werra et al., 2020). (Explanation: The paper mentions PyTorch, Hugging Face transformers, and TRL but does not specify their version numbers.) |
| Experiment Setup | Yes | For each experimental setup, we trained the model for 1 epoch, using a batch size of 1 for each GPU. The learning rate was set to 5e-7. The hyperparameter β used in Eq. 4 for DPO was set to 0.5. We utilized the AdamW optimizer and a cosine learning rate scheduler, with a warm-up ratio set to 0.1. |