reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Indirect Online Preference Optimization via Reinforcement Learning

Authors: En Wang, Xingyu Lin, Du Su, Chenfu Bao, Zhonghou Lv, Funing Yang, Yuanbo Xu, Wenbin Liu

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across three real-world datasets demonstrate that IOPO outperforms state-of-the-art alignment methods in both offline and online scenarios, evidenced by standard alignment metrics and human evaluations.
Researcher Affiliation	Collaboration	1College of Computer Science and Technology, Jilin University 2Key Laboratory of Symbolic Computation and Knowledge Engineering of MOE, Jilin University 3State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences 4 Baidu Inc. EMAIL, EMAIL, EMAIL,
Pseudocode	Yes	Algorithm 1 Indirect Online Preference Optimization
Open Source Code	No	The paper mentions where the code for DPO, RSO, IPO, and KTO-PAIR can be found (Hugging Face), but does not provide any explicit statement or link for the open-source code of their proposed IOPO methodology.
Open Datasets	Yes	1This dataset can be found on Hugging Face at https://hf-mirror.com/datasets/argilla/distilabel-capybara-dpo-7k-binarized 2https://hf-mirror.com/datasets/mlabonne/orpo-dpo-mix-40k
Dataset Splits	Yes	For better online performance evaluation, we split the safe world view dataset into two parts: 40k and 4k, using the 4k set for testing margin reward and pairwise accuracy.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications. It only mentions using 'Baichuan2-7B (base/chat)' which is a model, not hardware.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers for their implementation. It mentions external tools and models like ERNIE-3.5 and ERNIE-4.0, and refers to Hugging Face documentation for other methods, but not their own software stack.
Experiment Setup	Yes	Training: learning rate lr = 2 exp 7, batchsize = 32, β = 0.1. IOPO matches DPO s sensitivity to hyperparameters except clipping, and is robust to ϵ(set to 0.3).