TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

Authors: Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted on three instruction following benchmarks: Alpaca Eval 2 (Li et al., 2023), MT-Bench (Zheng et al., 2023), and Arena-Hard (Li et al., 2024). TGDPO consistently outperforms existing preference optimization algorithms, achieving improvements of up to 7.5 points on MT-Bench, 6.2 points on Alpaca Eval 2, and 4.3 points on Arena-Hard compared to the best baseline method. [...] In this section, we first outline our experiment settings in Section 5.1. Then we show the main experiment results in Section 5.2. Lastly, we provide an empirical analysis of the unique properties of our TGDPO in Section 5.3.
Researcher Affiliation Collaboration 1The Chinese University of Hong Kong 2The University of Hong Kong 3Huawei 4Smart More 5The Hong Kong University of Science and Technology. Correspondence to: Mingkang Zhu <EMAIL>, Jiaya Jia <EMAIL>.
Pseudocode No The paper contains mathematical derivations and descriptions of the methodology, but no section explicitly labeled 'Pseudocode' or 'Algorithm', nor any structured, step-by-step procedures formatted as code.
Open Source Code Yes Code is available at https://github.com/dvlab-research/TGDPO.
Open Datasets Yes Extensive experiments are conducted on three instruction following benchmarks: Alpaca Eval 2 (Li et al., 2023), MT-Bench (Zheng et al., 2023), and Arena-Hard (Li et al., 2024). [...] Following (Meng et al., 2024), we use prompts from the Ultra Feedback dataset (Cui et al., 2024) and let each model generate 5 responses with a temperature of 0.8.
Dataset Splits No Following (Meng et al., 2024), we use prompts from the Ultra Feedback dataset (Cui et al., 2024) and let each model generate 5 responses with a temperature of 0.8. These responses are then ranked using the Armo RM model (Wang et al., 2024). The highest and lowest-ranked responses are selected as the chosen and rejected samples, respectively. [...] The paper describes how preference pairs are generated from the Ultra Feedback dataset and evaluated on benchmarks, but it does not specify explicit training, validation, or test dataset splits in terms of percentages or counts for the main fine-tuning process.
Hardware Specification Yes The training is conducted using 8 A100 GPUs.
Software Dependencies No The Adam W optimizer (Loshchilov & Hutter, 2019) is used. The paper mentions the use of the Adam W optimizer but does not specify version numbers for any key software components or libraries like Python, PyTorch, or other relevant frameworks.
Experiment Setup Yes Following (Meng et al., 2024), we use a consistent batch size of 128 and train all methods for 1 epoch in all settings. The Adam W optimizer (Loshchilov & Hutter, 2019) is used. The max sequence length is set to be 2048 and a cosine learning rate schedule with 10% warm-up steps is used. The hyperparameters for each method are grid-searched and are shown in Table 8 for DPO, Table 9 for Sim PO, Table 10 for our TGDPO correspondingly.