Indirect Online Preference Optimization via Reinforcement Learning

Authors: En Wang, Xingyu Lin, Du Su, Chenfu Bao, Zhonghou Lv, Funing Yang, Yuanbo Xu, Wenbin Liu

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across three real-world datasets demonstrate that IOPO outperforms state-of-the-art alignment methods in both offline and online scenarios, evidenced by standard alignment metrics and human evaluations.
Researcher Affiliation Collaboration 1College of Computer Science and Technology, Jilin University 2Key Laboratory of Symbolic Computation and Knowledge Engineering of MOE, Jilin University 3State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences 4 Baidu Inc. EMAIL, EMAIL, EMAIL,
Pseudocode Yes Algorithm 1 Indirect Online Preference Optimization
Open Source Code No The paper mentions where the code for DPO, RSO, IPO, and KTO-PAIR can be found (Hugging Face), but does not provide any explicit statement or link for the open-source code of their proposed IOPO methodology.
Open Datasets Yes 1This dataset can be found on Hugging Face at https://hf-mirror.com/datasets/argilla/distilabel-capybara-dpo-7k-binarized 2https://hf-mirror.com/datasets/mlabonne/orpo-dpo-mix-40k
Dataset Splits Yes For better online performance evaluation, we split the safe world view dataset into two parts: 40k and 4k, using the 4k set for testing margin reward and pairwise accuracy.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications. It only mentions using 'Baichuan2-7B (base/chat)' which is a model, not hardware.
Software Dependencies No The paper does not provide specific software dependencies with version numbers for their implementation. It mentions external tools and models like ERNIE-3.5 and ERNIE-4.0, and refers to Hugging Face documentation for other methods, but not their own software stack.
Experiment Setup Yes Training: learning rate lr = 2 exp 7, batchsize = 32, β = 0.1. IOPO matches DPO s sensitivity to hyperparameters except clipping, and is robust to ϵ(set to 0.3).