Indirect Online Preference Optimization via Reinforcement Learning
Authors: En Wang, Xingyu Lin, Du Su, Chenfu Bao, Zhonghou Lv, Funing Yang, Yuanbo Xu, Wenbin Liu
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across three real-world datasets demonstrate that IOPO outperforms state-of-the-art alignment methods in both offline and online scenarios, evidenced by standard alignment metrics and human evaluations. |
| Researcher Affiliation | Collaboration | 1College of Computer Science and Technology, Jilin University 2Key Laboratory of Symbolic Computation and Knowledge Engineering of MOE, Jilin University 3State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences 4 Baidu Inc. EMAIL, EMAIL, EMAIL, |
| Pseudocode | Yes | Algorithm 1 Indirect Online Preference Optimization |
| Open Source Code | No | The paper mentions where the code for DPO, RSO, IPO, and KTO-PAIR can be found (Hugging Face), but does not provide any explicit statement or link for the open-source code of their proposed IOPO methodology. |
| Open Datasets | Yes | 1This dataset can be found on Hugging Face at https://hf-mirror.com/datasets/argilla/distilabel-capybara-dpo-7k-binarized 2https://hf-mirror.com/datasets/mlabonne/orpo-dpo-mix-40k |
| Dataset Splits | Yes | For better online performance evaluation, we split the safe world view dataset into two parts: 40k and 4k, using the 4k set for testing margin reward and pairwise accuracy. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications. It only mentions using 'Baichuan2-7B (base/chat)' which is a model, not hardware. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers for their implementation. It mentions external tools and models like ERNIE-3.5 and ERNIE-4.0, and refers to Hugging Face documentation for other methods, but not their own software stack. |
| Experiment Setup | Yes | Training: learning rate lr = 2 exp 7, batchsize = 32, β = 0.1. IOPO matches DPO s sensitivity to hyperparameters except clipping, and is robust to ϵ(set to 0.3). |