TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

Authors: Weibin Liao, Xu Chu, Yasha Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We carry out extensive experiments on mathematical reasoning tasks to evaluate TPO. The experimental results indicate that TPO consistently outperforms DPO across five publicly large language models on four datasets.
Researcher Affiliation Academia Weibin Liao , Xu Chu , Yasha Wang School of Computer Science, Peking University Center on Frontiers of Computing Studies, Peking University National Research and Engineering Center of Software Engineering, Peking University EMAIL, EMAIL
Pseudocode Yes Ultimately, the overall algorithm of TPO is detailed in Appendix Alg. 1.
Open Source Code Yes https://github.com/MrBlankness/TPO.git
Open Datasets Yes The dataset was derived from the Meta Math (Yu et al., 2023), MMIQC (Liu & Yao, 2024), and AQuA (Ling et al., 2017) datasets. We introduced three types of tasks, Math (in-distribution), Coding and Reasoning (out-of-distribution), to assess the effectiveness of TPO. For the Math tasks, we considered the following datasets: MATH (Hendrycks et al., 2021), SVAMP (Patel et al., 2021), ASDiv (Miao et al., 2021) and GSM-Plus (Li et al., 2024a). For the Coding tasks, we considered the Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021) datasets. For Reasoning task, we considered the BBH (Suzgun et al., 2023) and MMLU (Hendrycks et al.) datasets.
Dataset Splits No Lai et al. (2024) proposed a dataset that provides 10,795 paired preference data, completely composed of mathematical problems, with complete correct and incorrect reasoning trajectories provided for each problem... We collected 10 trajectories for each problem... We introduced three types of tasks, Math (in-distribution), Coding and Reasoning (out-of-distribution), to assess the effectiveness of TPO. For the Math tasks, we considered the following datasets: MATH (Hendrycks et al., 2021), SVAMP (Patel et al., 2021), ASDiv (Miao et al., 2021) and GSM-Plus (Li et al., 2024a). (Explanation: The paper describes the generation of training data and lists evaluation datasets, but it does not specify explicit training/validation/test splits, such as percentages or sample counts for its preference data, nor does it explicitly state the specific splits used for the listed benchmark evaluation datasets.)
Hardware Specification Yes The experiments were conducted on 8 NVIDIA-RTX3090-24GB GPUs.
Software Dependencies No We used the PyTorch library to implement all the algorithms based on the open-source Hugging Face transformers (Wolf, 2019) and Transformer Reinforcement Learning (TRL) (von Werra et al., 2020). (Explanation: The paper mentions PyTorch, Hugging Face transformers, and TRL but does not specify their version numbers.)
Experiment Setup Yes For each experimental setup, we trained the model for 1 epoch, using a batch size of 1 for each GPU. The learning rate was set to 5e-7. The hyperparameter β used in Eq. 4 for DPO was set to 0.5. We utilized the AdamW optimizer and a cosine learning rate scheduler, with a warm-up ratio set to 0.1.