RuleAdapter: Dynamic Rules for training Safety Reward Models in RLHF
Authors: Xiaomin Li, Mingye Gao, Zhiwei Zhang, Jingxuan Fan, Weiyu Li
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then train an 8B reward model using the adaptively labeled preference dataset and evaluate its performance on Reward Bench. As of May 25, 2025, our model achieved the highest safety performance on the leaderboard, outperforming various larger models. We conduct experiments to verify that the reward model trained with the Rule Adapter achieves superior safety performance, leading the Reward Bench leaderboard. We implement a complete RLHF process using PPO with our trained reward model RAMO, showcasing significantly improved safety performance of the aligned policy. |
| Researcher Affiliation | Academia | 1Harvard University. 2Massachusetts Institute of Technology. 3Pennsylvania State University. Correspondence to: Xiaomin Li <EMAIL>. |
| Pseudocode | No | The paper describes methods and theoretical analyses but does not include any explicitly labeled pseudocode or algorithm blocks with structured, step-by-step procedures in a code-like format. |
| Open Source Code | Yes | We release the rule pool, the synthetic safety preference dataset, the Rule Adapter, and the trained reward model RAMO, contributing valuable resources for further study 1. 1The datasets and models will be released once the paper review process is complete. Our code is available at: https://anonymous.4open.science/r/Dynamic Rules-7F5E/ |
| Open Datasets | Yes | To achieve this, we applied our approach to HH-RLHF (Anthropic, 2022), a commonly used preference dataset for safety alignment. For each trio in the datasets we first identify the 5 most critical rules using the Rule Adapter. |
| Dataset Splits | No | The paper mentions training on '1K data' for the reward model and using '1K, 2K, 5K' data sizes from HH-RLHF for a generalization study, stating 'the data for training the reward model is randomly selected from the whole dataset with 2 seeds'. However, it does not explicitly provide specific percentages, counts, or a detailed methodology for how these datasets are split into training, validation, or test sets. |
| Hardware Specification | No | The paper mentions 'high GPU requirements for accommodating both the reward model and policy during PPO' but does not specify any particular GPU models, CPU types, or other hardware specifications used for running their experiments. |
| Software Dependencies | No | The paper mentions using specific LLM architectures like 'Llama3.1-8B' and 'Llama3.2-3B' as base models for training. However, it does not provide a list of ancillary software dependencies (e.g., programming languages, libraries, frameworks) along with their specific version numbers. |
| Experiment Setup | Yes | RAMO is trained on the 1K data for 2 epochs with a learning rate 2 10 5. For consistency and based on empirical evidence, we set r = 5 for all experiments. During inference, the temperature and top_p are set to 0.6 and 0.9 to ensure the diversity of the generated responses; max_new_token is set to 256 to avoid too long responses. Several γ values were explored in Table 7, we eventually choose γ = 2. We tried different combinations of learning rate, training epochs, and size of the data. |