reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization

Authors: Juntao Dai, Taiye Chen, Yaodong Yang, Qian Zheng, Gang Pan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present experiments to demonstrate the effectiveness of the BSPO algorithm. Specifically, we focus on the following three aspects: BSPO outperforms baseline algorithms, proving its capacity to better mitigate reward overoptimization and find the optimal in-distribution policies (Section 5.2). BSPO reduces the generation of OOD responses during the RL, thereby avoiding overestimation caused by the extrapolation errors of the reward prediction (Section 5.3). BSPO effectively avoids over-optimization at larger KL divergence distances (Section 5.4).
Researcher Affiliation	Academia	Juntao Dai12, Taiye Chen4, Yaodong Yang34 , Qian Zheng12 , Gang Pan12 1College of Computer Science and Technology, Zhejiang University 2The State Key Lab of Brain-Machine Intelligence, Zhejiang University 3LLM Safety Centre, Beijing Academy of Artificial Intelligence 4Center for AI Safety and Governance, Peking University EMAIL {yeyutaihan}@stu.pku.edu.cn, {yaodong.yang}@pku.edu.cn
Pseudocode	Yes	B.2 PSEUDO-CODE We provide the pseudo-code of the implementation of our BSPO algorithm as follows: Algorithm 1 Behavior-Supported Policy Optimization
Open Source Code	No	The paper does not contain any explicit statements about open-sourcing the code, nor does it provide links to a code repository or indicate that code is available in supplementary materials for the methodology described in this paper.
Open Datasets	Yes	The gold model is trained using 57k preference pairs from the binarized Ultra Feedback dataset (Cui et al., 2023). ... We also applied our pipeline to the Alpaca Farm dataset (Dubois et al., 2024b).
Dataset Splits	No	The paper mentions training data volume (e.g., 57k preference pairs, 30k data points) and evaluation on a 'test set', but it does not provide specific train/validation/test split percentages or sample counts to reproduce the data partitioning.
Hardware Specification	Yes	All experiments in this paper utilize the following runtime environment. The server s CPU is an Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz with 128 cores, and the graphics cards are NVIDIA A100-SXM4-80GB 8, with NVLink support and the graphics driver version being 550.54.15.
Software Dependencies	No	The paper mentions a graphics driver version (550.54.15) but does not provide specific version numbers for other key software components, programming languages, or libraries used in the experimental setup.
Experiment Setup	Yes	In this section, we provide all the hyper-parameters used in our experiments. Table 1: Hyper-parameters of three experimental settings of BSPO. Table 2: Hyper-parameters of Reward Model Training.