On the Robustness of Reward Models for Language Model Alignment

Authors: Jiwoo Hong, Noah Lee, Eunki Kim, Guijin Son, Woojin Chung, Aman Gupta, Shao Tang, James Thorne

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Then, we propose batch-wise sum-to-zero regularization (BSR) to enforce zero-centered reward sum per batch, constraining the rewards with extreme magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness. Subsequently, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model.
Researcher Affiliation Collaboration 1KAIST AI 2One Line AI 3Linked In Corporation. Correspondence to: Jiwoo Hong <jiwoo EMAIL>.
Pseudocode No The paper includes mathematical equations and descriptions of methods, but does not contain any clearly labeled pseudocode or algorithm blocks. For example, Section 4.2 'Method: Batch sum-to-zero regularization (BSR)' describes the method through text and equations without a structured algorithm block.
Open Source Code Yes We release the code, data, and models: https://github.com/Linked In-XFACT/RM-Robustness.
Open Datasets Yes We adopt Ultra Feedback (Cui et al., 2025, UF), which harnesses 17 different models with varying model families to sample four responses per prompt, to set one train set and four validation sets as described in Section 2.2. Finally, we extend our experiments in Sections 3.1 and 3.2 to the 8B model and high-quality synthetic preference data to demonstrate the scalability and effectiveness of the proposed method. Models and datasets We use TULU3 SFT mixture (Lambert et al., 2024) to conduct SFT on Qwen2.5-1.5B. Then, we employ Llama-3.1-8B based RMs trained with LBT and LBT-BSR on Skywork-Reward-Preference-80K-v0.21, high-quality synthetic preference dataset (Liu et al., 2024a; 2025b). For RMBT, we use the official checkpoint of Liu et al. (2024a). When r is not available, we evaluate RMs through RM-Bench (Liu et al., 2025b), which provides preference pairs with subtle differences and achieves higher correlation against actual use cases than Reward Bench (Lambert et al., 2025).
Dataset Splits Yes We adopt Ultra Feedback (Cui et al., 2025, UF), which harnesses 17 different models with varying model families to sample four responses per prompt, to set one train set and four validation sets as described in Section 2.2. Train set (Dtrain) First, we select a random set of 51,200 samples from UF as the train set. Then, we choose two random responses out of four for each prompt in the train set. Thereby, we have 51,200 triplets comprising prompt and corresponding chosen and rejected responses, according to Armo RM. Validation 1 in-domain (DID) Then, we use the remaining two responses in 51,200 prompts in the train set as the in-domain validation set to evaluate if the trained reward models can generalize in same prompt and response spaces. Since we have two responses per prompt, we use binary accuracy as an evaluation metric. Validation 2 prompt out-of-domain (DPrompt-OOD) We set the remaining 12,800 instances as a prompt out-of-domain (Prompt OOD) validation set, having different prompt space but same response space. Validation 3 response out-of-domain (DResponse-OOD) For the prompts in the train set, we additionally generate four different responses from the new models: Gemma-2-2B-It (Team et al., 2024), Olmo2-7B-Instruct6(OLMo et al., 2025), Smol LM2-1.7B-Instruct (Allal et al., 2025), and Mistral-Instruct-v0.2 (Jiang et al., 2023a). Validation 4 mutual out-of-domain (DMutual-OOD) Using the same models in the Response OOD set, we generate the responses for the prompts from the Prompt OOD set, having mutual out-of-domain (Mutual OOD) validation set.
Hardware Specification Yes We used NVIDIA A100 and A6000 GPUs throughout the experiments.
Software Dependencies No For supervised fine-tuning (SFT) and reward modeling, we use Liger-Kernel (Hsu et al., 2024) with Deep Speed Ze RO-3 (Rajbhandari et al., 2020) and FSDP (Zhao et al., 2023) for efficient training. Including reinforcement learning with human feedback (RLHF) phase, we utilize the TRL library (von Werra et al., 2020) to adjust to our usage. We used NVIDIA A100 and A6000 GPUs throughout the experiments.
Experiment Setup Yes We train every model on Ultra Chat for a single epoch with a global batch size of 512 following Tunstall et al. (2024). We set a learning rate of 10 5 with 10% warmup and cosine decay. We train reward models on top of the SFT models above with four different seeds. We fix the global batch size of 128 across the models and methods. We set a learning rate of 5 10 6 for Llama-3.2-1B and Qwen2.5-1.5B models, 3 10 6 for 3B models, and 2 10 6 for Llama-3.1-8B and Qwen2.5-7B models. 5% warmup and linear decay were applied following Lambert et al. (2024). We use FSDP for distributed training in reward modeling. We set λ = 10 3 in Section 5.1 and ablate different λ in Section 5.3. Table 4. RLOO training configuration details for each section. We train Qwen2.5-1.5B with SFT using corresponding reward models for each section using 4 A6000 GPUs, excluding the GPUs assigned for the reward model and v LLM engine for on-policy generation. Category Section 3.2 Section 3.3 Learning Rate 2 10 6 1 10 6 β 0.05 0.05 Number of responses (k) 2 2 Global Batch (Effective) 128 128 Learning Rate Scheduler Linear Decay Linear Decay Warmup Ratio 0.03 0.03 Training Epochs 5 5