Self-Evolutionary Large Language Models Through Uncertainty-Enhanced Preference Optimization

Authors: Jianing Wang, Yang Zhou, Xiaocheng Zhang, Mengjiao Bao, Peng Yan

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments over multiple benchmarks demonstrate that our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization. Extensive experiments on two universal NLP benchmarks (i.e., Alpaca Eval 2.0 (Dubois et al. 2024) and MT-Bench (Zheng et al. 2023a)) and two mathematics reasoning tasks (i.e., GSM8K (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021)), results demonstrate that our UPO framework substantially enhances the effectiveness of preference alignment, and achieves the best performance in auto evaluation 1. Main Results As shown in Table 1, the results of Alpaca Eval 2.0 denote the win rate compared to the reference generated by GPT-4
Researcher Affiliation Industry Meituan EMAIL
Pseudocode Yes The whole algorithm is shown in Algorithm 1 in Appendix B.
Open Source Code No The code will be released at https://github.com/wjn1996/Uncertainty-Preference-Optimization.
Open Datasets Yes We conduct extensive experiments on two universal NLP benchmarks (i.e., Alpaca Eval 2.0 (Dubois et al. 2024) and MT-Bench (Zheng et et al. 2023a)) and two mathematics reasoning tasks (i.e., GSM8K (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021)). The labeled preference data we used is Ultra Feedback (Cui et al. 2023), which consists of 61K prompts post-processed by Tunstall et al. (2023). We also select Ultra Chat200K as the prompt set. For the implementation, we choose Math Instruct (Yue et al. 2024) as the prompt set... The well-constructed fine-grained feedback data is Math-Step-DPO-10K which involves 10.8K prompts with both coarse-grained and fine-grained annotation towards the answers.
Dataset Splits Yes We respectively sample 200 preference data from the validation set of Ultra Feedback, Alpaca Eval 2.0, and MATH-Step-DPO-10K to manually construct the evaluation set. For Alpaca Eval 2.0, we use the reference generated from GPT-4 as the preferred response, while the dispreferred response is created by the SFT model.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific software dependency versions (e.g., Python, PyTorch, CUDA versions) needed to replicate the experiment.
Experiment Setup Yes We repeatedly train three models (i.e., LLM policy, reward, and estimator) for three iterations. For the implementation details, we use MC Dropout in BNN to estimate the information gain. Specifically, we open the dropout and repeat T (default set as 10) times to get independent and identically distributed (i.i.d.) predictions. More details of these benchmarks and hyper-parameters of each training iteration are listed in Appendix C.