ROPO: Robust Preference Optimization for Large Language Models
Authors: Xize Liang, Chao Chen, Shuang Qiu, Jie Wang, Yue Wu, Zhihang Fu, Hanzhu Chen, Feng Wu, Jieping Ye
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on several widely-used datasets and model architectures demonstrate that ROPO significantly outperforms all baselines under four practical noise settings and the random symmetric noise, with its advantage increasing as the noise rate increases. Evaluation results on Alpaca Eval, Arena-Hard, and MT-Bench show that the performance of ROPO remains stable in both practical and artificial noisy scenarios. |
| Researcher Affiliation | Collaboration | Xize Liang * 1 Chao Chen * 2 Shuang Qiu * 3 Jie Wang 1 Yue Wu 2 Zhihang Fu 2 Hanzhu Chen 1 Feng Wu 1 Jieping Ye 2 1Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China 2Independent Researcher 3City University of Hong Kong. Correspondence to: Jie Wang <EMAIL>. |
| Pseudocode | Yes | Please see Appendix A for the detailed description and pseudocode of the framework. Algorithm 1 ROPO |
| Open Source Code | No | The paper does not contain an unambiguous statement where the authors state they are releasing the code for the work described in this paper, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | Tasks and Datasets. We focus on two dialogue datasets (i.e., Ultra Feedback Binarized3 (UFB) and Alpaca Comparison (Peng et al., 2023)) and one post-summarization dataset (i.e., Reddit TL;DR (V olske et al., 2017; Stiennon et al., 2020)). 3https://huggingface.co/datasets/ Hugging Face H4/ultrafeedback_binarized |
| Dataset Splits | Yes | For models trained on TL;DR, we evaluate them by comparing their outputs with the SFT targets (chosen responses) on the test split of TL;DR. We randomly alter preference labels at different proportions (20% and 40%) within the datasets to produce more challenging symmetric noise (Gao et al., 2024). |
| Hardware Specification | Yes | We run all experiments on 16 NVIDIA A100 GPUs (80 GB). |
| Software Dependencies | No | The paper mentions several LLM models used (e.g., Mistral-7B, Llama-2-7B, GPT-4, text-davinci-003) but does not provide specific version numbers for general software dependencies such as Python, PyTorch, or other libraries used in the implementation. |
| Experiment Setup | Yes | Unless otherwise noted, we use a global batch size of 512 to train all models. For all hyperparameters except for ε of label smoothing, we search for the best one on each dataset without artificial noise and use the same setting for 20% and 40% artificial noise. For all methods, we search the best learning rate in {1e-5, 5e-6, 1e-6, 5e-7, 1e-7} and the best β in {0.1, 0.5}. For ROPO, we use α = 14 and ρ = 0.2 in the main experiments. ...We set K = 3 for the rejection sampling. |