RuleAdapter: Dynamic Rules for training Safety Reward Models in RLHF

Authors: Xiaomin Li, Mingye Gao, Zhiwei Zhang, Jingxuan Fan, Weiyu Li

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then train an 8B reward model using the adaptively labeled preference dataset and evaluate its performance on Reward Bench. As of May 25, 2025, our model achieved the highest safety performance on the leaderboard, outperforming various larger models. We conduct experiments to verify that the reward model trained with the Rule Adapter achieves superior safety performance, leading the Reward Bench leaderboard. We implement a complete RLHF process using PPO with our trained reward model RAMO, showcasing significantly improved safety performance of the aligned policy.
Researcher Affiliation Academia 1Harvard University. 2Massachusetts Institute of Technology. 3Pennsylvania State University. Correspondence to: Xiaomin Li <EMAIL>.
Pseudocode No The paper describes methods and theoretical analyses but does not include any explicitly labeled pseudocode or algorithm blocks with structured, step-by-step procedures in a code-like format.
Open Source Code Yes We release the rule pool, the synthetic safety preference dataset, the Rule Adapter, and the trained reward model RAMO, contributing valuable resources for further study 1. 1The datasets and models will be released once the paper review process is complete. Our code is available at: https://anonymous.4open.science/r/Dynamic Rules-7F5E/
Open Datasets Yes To achieve this, we applied our approach to HH-RLHF (Anthropic, 2022), a commonly used preference dataset for safety alignment. For each trio in the datasets we first identify the 5 most critical rules using the Rule Adapter.
Dataset Splits No The paper mentions training on '1K data' for the reward model and using '1K, 2K, 5K' data sizes from HH-RLHF for a generalization study, stating 'the data for training the reward model is randomly selected from the whole dataset with 2 seeds'. However, it does not explicitly provide specific percentages, counts, or a detailed methodology for how these datasets are split into training, validation, or test sets.
Hardware Specification No The paper mentions 'high GPU requirements for accommodating both the reward model and policy during PPO' but does not specify any particular GPU models, CPU types, or other hardware specifications used for running their experiments.
Software Dependencies No The paper mentions using specific LLM architectures like 'Llama3.1-8B' and 'Llama3.2-3B' as base models for training. However, it does not provide a list of ancillary software dependencies (e.g., programming languages, libraries, frameworks) along with their specific version numbers.
Experiment Setup Yes RAMO is trained on the 1K data for 2 epochs with a learning rate 2 10 5. For consistency and based on empirical evidence, we set r = 5 for all experiments. During inference, the temperature and top_p are set to 0.6 and 0.9 to ensure the diversity of the generated responses; max_new_token is set to 256 to avoid too long responses. Several γ values were explored in Table 7, we eventually choose γ = 2. We tried different combinations of learning rate, training epochs, and size of the data.