reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Self-Evolutionary Large Language Models Through Uncertainty-Enhanced Preference Optimization

Authors: Jianing Wang, Yang Zhou, Xiaocheng Zhang, Mengjiao Bao, Peng Yan

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments over multiple benchmarks demonstrate that our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization. Extensive experiments on two universal NLP benchmarks (i.e., Alpaca Eval 2.0 (Dubois et al. 2024) and MT-Bench (Zheng et al. 2023a)) and two mathematics reasoning tasks (i.e., GSM8K (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021)), results demonstrate that our UPO framework substantially enhances the effectiveness of preference alignment, and achieves the best performance in auto evaluation 1. Main Results As shown in Table 1, the results of Alpaca Eval 2.0 denote the win rate compared to the reference generated by GPT-4
Researcher Affiliation	Industry	Meituan EMAIL
Pseudocode	Yes	The whole algorithm is shown in Algorithm 1 in Appendix B.
Open Source Code	No	The code will be released at https://github.com/wjn1996/Uncertainty-Preference-Optimization.
Open Datasets	Yes	We conduct extensive experiments on two universal NLP benchmarks (i.e., Alpaca Eval 2.0 (Dubois et al. 2024) and MT-Bench (Zheng et et al. 2023a)) and two mathematics reasoning tasks (i.e., GSM8K (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021)). The labeled preference data we used is Ultra Feedback (Cui et al. 2023), which consists of 61K prompts post-processed by Tunstall et al. (2023). We also select Ultra Chat200K as the prompt set. For the implementation, we choose Math Instruct (Yue et al. 2024) as the prompt set... The well-constructed ﬁne-grained feedback data is Math-Step-DPO-10K which involves 10.8K prompts with both coarse-grained and ﬁne-grained annotation towards the answers.
Dataset Splits	Yes	We respectively sample 200 preference data from the validation set of Ultra Feedback, Alpaca Eval 2.0, and MATH-Step-DPO-10K to manually construct the evaluation set. For Alpaca Eval 2.0, we use the reference generated from GPT-4 as the preferred response, while the dispreferred response is created by the SFT model.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper does not provide specific software dependency versions (e.g., Python, PyTorch, CUDA versions) needed to replicate the experiment.
Experiment Setup	Yes	We repeatedly train three models (i.e., LLM policy, reward, and estimator) for three iterations. For the implementation details, we use MC Dropout in BNN to estimate the information gain. Speciﬁcally, we open the dropout and repeat T (default set as 10) times to get independent and identically distributed (i.i.d.) predictions. More details of these benchmarks and hyper-parameters of each training iteration are listed in Appendix C.