reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ROPO: Robust Preference Optimization for Large Language Models

Authors: Xize Liang, Chao Chen, Shuang Qiu, Jie Wang, Yue Wu, Zhihang Fu, Hanzhu Chen, Feng Wu, Jieping Ye

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on several widely-used datasets and model architectures demonstrate that ROPO significantly outperforms all baselines under four practical noise settings and the random symmetric noise, with its advantage increasing as the noise rate increases. Evaluation results on Alpaca Eval, Arena-Hard, and MT-Bench show that the performance of ROPO remains stable in both practical and artificial noisy scenarios.
Researcher Affiliation	Collaboration	Xize Liang * 1 Chao Chen * 2 Shuang Qiu * 3 Jie Wang 1 Yue Wu 2 Zhihang Fu 2 Hanzhu Chen 1 Feng Wu 1 Jieping Ye 2 1Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China 2Independent Researcher 3City University of Hong Kong. Correspondence to: Jie Wang <EMAIL>.
Pseudocode	Yes	Please see Appendix A for the detailed description and pseudocode of the framework. Algorithm 1 ROPO
Open Source Code	No	The paper does not contain an unambiguous statement where the authors state they are releasing the code for the work described in this paper, nor does it provide a direct link to a source-code repository.
Open Datasets	Yes	Tasks and Datasets. We focus on two dialogue datasets (i.e., Ultra Feedback Binarized3 (UFB) and Alpaca Comparison (Peng et al., 2023)) and one post-summarization dataset (i.e., Reddit TL;DR (V olske et al., 2017; Stiennon et al., 2020)). 3https://huggingface.co/datasets/ Hugging Face H4/ultrafeedback_binarized
Dataset Splits	Yes	For models trained on TL;DR, we evaluate them by comparing their outputs with the SFT targets (chosen responses) on the test split of TL;DR. We randomly alter preference labels at different proportions (20% and 40%) within the datasets to produce more challenging symmetric noise (Gao et al., 2024).
Hardware Specification	Yes	We run all experiments on 16 NVIDIA A100 GPUs (80 GB).
Software Dependencies	No	The paper mentions several LLM models used (e.g., Mistral-7B, Llama-2-7B, GPT-4, text-davinci-003) but does not provide specific version numbers for general software dependencies such as Python, PyTorch, or other libraries used in the implementation.
Experiment Setup	Yes	Unless otherwise noted, we use a global batch size of 512 to train all models. For all hyperparameters except for ε of label smoothing, we search for the best one on each dataset without artificial noise and use the same setting for 20% and 40% artificial noise. For all methods, we search the best learning rate in {1e-5, 5e-6, 1e-6, 5e-7, 1e-7} and the best β in {0.1, 0.5}. For ROPO, we use α = 14 and ρ = 0.2 in the main experiments. ...We set K = 3 for the rejection sampling.