reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

Authors: Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are conducted on three instruction following benchmarks: Alpaca Eval 2 (Li et al., 2023), MT-Bench (Zheng et al., 2023), and Arena-Hard (Li et al., 2024). TGDPO consistently outperforms existing preference optimization algorithms, achieving improvements of up to 7.5 points on MT-Bench, 6.2 points on Alpaca Eval 2, and 4.3 points on Arena-Hard compared to the best baseline method. [...] In this section, we first outline our experiment settings in Section 5.1. Then we show the main experiment results in Section 5.2. Lastly, we provide an empirical analysis of the unique properties of our TGDPO in Section 5.3.
Researcher Affiliation	Collaboration	1The Chinese University of Hong Kong 2The University of Hong Kong 3Huawei 4Smart More 5The Hong Kong University of Science and Technology. Correspondence to: Mingkang Zhu <EMAIL>, Jiaya Jia <EMAIL>.
Pseudocode	No	The paper contains mathematical derivations and descriptions of the methodology, but no section explicitly labeled 'Pseudocode' or 'Algorithm', nor any structured, step-by-step procedures formatted as code.
Open Source Code	Yes	Code is available at https://github.com/dvlab-research/TGDPO.
Open Datasets	Yes	Extensive experiments are conducted on three instruction following benchmarks: Alpaca Eval 2 (Li et al., 2023), MT-Bench (Zheng et al., 2023), and Arena-Hard (Li et al., 2024). [...] Following (Meng et al., 2024), we use prompts from the Ultra Feedback dataset (Cui et al., 2024) and let each model generate 5 responses with a temperature of 0.8.
Dataset Splits	No	Following (Meng et al., 2024), we use prompts from the Ultra Feedback dataset (Cui et al., 2024) and let each model generate 5 responses with a temperature of 0.8. These responses are then ranked using the Armo RM model (Wang et al., 2024). The highest and lowest-ranked responses are selected as the chosen and rejected samples, respectively. [...] The paper describes how preference pairs are generated from the Ultra Feedback dataset and evaluated on benchmarks, but it does not specify explicit training, validation, or test dataset splits in terms of percentages or counts for the main fine-tuning process.
Hardware Specification	Yes	The training is conducted using 8 A100 GPUs.
Software Dependencies	No	The Adam W optimizer (Loshchilov & Hutter, 2019) is used. The paper mentions the use of the Adam W optimizer but does not specify version numbers for any key software components or libraries like Python, PyTorch, or other relevant frameworks.
Experiment Setup	Yes	Following (Meng et al., 2024), we use a consistent batch size of 128 and train all methods for 1 epoch in all settings. The Adam W optimizer (Loshchilov & Hutter, 2019) is used. The max sequence length is set to be 2048 and a cosine learning rate schedule with 10% warm-up steps is used. The hyperparameters for each method are grid-searched and are shown in Table 8 for DPO, Table 9 for Sim PO, Table 10 for our TGDPO correspondingly.