Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization
Authors: Juntao Dai, Taiye Chen, Yaodong Yang, Qian Zheng, Gang Pan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present experiments to demonstrate the effectiveness of the BSPO algorithm. Specifically, we focus on the following three aspects: BSPO outperforms baseline algorithms, proving its capacity to better mitigate reward overoptimization and find the optimal in-distribution policies (Section 5.2). BSPO reduces the generation of OOD responses during the RL, thereby avoiding overestimation caused by the extrapolation errors of the reward prediction (Section 5.3). BSPO effectively avoids over-optimization at larger KL divergence distances (Section 5.4). |
| Researcher Affiliation | Academia | Juntao Dai12, Taiye Chen4, Yaodong Yang34 , Qian Zheng12 , Gang Pan12 1College of Computer Science and Technology, Zhejiang University 2The State Key Lab of Brain-Machine Intelligence, Zhejiang University 3LLM Safety Centre, Beijing Academy of Artificial Intelligence 4Center for AI Safety and Governance, Peking University EMAIL {yeyutaihan}@stu.pku.edu.cn, {yaodong.yang}@pku.edu.cn |
| Pseudocode | Yes | B.2 PSEUDO-CODE We provide the pseudo-code of the implementation of our BSPO algorithm as follows: Algorithm 1 Behavior-Supported Policy Optimization |
| Open Source Code | No | The paper does not contain any explicit statements about open-sourcing the code, nor does it provide links to a code repository or indicate that code is available in supplementary materials for the methodology described in this paper. |
| Open Datasets | Yes | The gold model is trained using 57k preference pairs from the binarized Ultra Feedback dataset (Cui et al., 2023). ... We also applied our pipeline to the Alpaca Farm dataset (Dubois et al., 2024b). |
| Dataset Splits | No | The paper mentions training data volume (e.g., 57k preference pairs, 30k data points) and evaluation on a 'test set', but it does not provide specific train/validation/test split percentages or sample counts to reproduce the data partitioning. |
| Hardware Specification | Yes | All experiments in this paper utilize the following runtime environment. The server s CPU is an Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz with 128 cores, and the graphics cards are NVIDIA A100-SXM4-80GB 8, with NVLink support and the graphics driver version being 550.54.15. |
| Software Dependencies | No | The paper mentions a graphics driver version (550.54.15) but does not provide specific version numbers for other key software components, programming languages, or libraries used in the experimental setup. |
| Experiment Setup | Yes | In this section, we provide all the hyper-parameters used in our experiments. Table 1: Hyper-parameters of three experimental settings of BSPO. Table 2: Hyper-parameters of Reward Model Training. |