GRAM: A Generative Foundation Reward Model for Reward Generalization
Authors: Chenglong Wang, Yang Gan, Yifu Huo, Yongyu Mu, Qiaozhi He, Murun Yang, Bei Li, Tong Xiao, Chunliang Zhang, Tongran Liu, Jingbo Zhu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that this model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, achieving significant performance improvements over several strong baseline models. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Engineering, Northeastern University, Shenyang, China 2Meituan Inc. 3Niu Trans Research, Shenyang, China 4CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China. Correspondence to: Chenglong Wang <EMAIL>, Tong Xiao <EMAIL>. |
| Pseudocode | No | The paper includes figures illustrating architectures (Figure 1) and two-stage training (Figure 3), and mathematical equations for loss functions, but no explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | Our codebase could be found at https://github.com/NiuTrans/GRAM. |
| Open Datasets | Yes | We initialized our GRAM model with the LLa MA-3.1-8B-Instruct and LLa MA-3.2-3B-Instruct models, using a subset of 400k samples from Unified-Feedback for each. The test set was Alpaca Eval2 (Li et al., 2023). Additionally, we trained an oracle reward model using preference data from Alpaca Farm (Dubois et al., 2023)... For each task, we vary the number of summarization data across {0k, 1k, 3k, 5k, 7k, 10k}, derived from preference data labeled by Stiennon et al. (2020) and Bai et al. (2022), respectively. |
| Dataset Splits | Yes | We trained both types of models on a subset of 400k and 40k samples from the Unified Feedback dataset. We then evaluated these models on an in-distribution (ID) test set, consisting of 1k test samples from Unified-Feedback, and an out-of-distribution (OOD) test set, consisting of 3k samples from Reward Bench (Lambert et al., 2024). We used the data splits provided by Alpaca Farm (Dubois et al., 2023) in performing SFT and PPO fine-tuning. |
| Hardware Specification | Yes | All of our experiments were done on eight A800 GPUs. |
| Software Dependencies | No | The paper mentions using 'LLa MA-3.1-8B-Instruct' and 'LLa MA-3.2-3B-Instruct' as initial models and 'trlx implementation' for PPO, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | The learning rates were set to 2e-5 for the first stage and 1e-5 for the second stage, with training conducted over one epoch in each stage. In the second stage, the label smoothing parameter was set to 0.1... During conducting the SFT training, we set the learning rate, batch size, and training epoch to 1e-5, 256, and 2. We trained the discriminative and generative reward model baselines for one epoch with a learning rate of 1e-5 and a batch size of 256. In the Label Smoothing baseline for the generative reward model, we set the smoothing factor to 0.1... Best-of-n Sampling... setting p to 0.95 and temperature to 0.75. For all experiments, the learning rate was set to 1e-5 and 5e-6 for the policy model and the value model, respectively. We settled on a batch size of 64 for each PPO step, which consisted of 1 epoch of gradient steps and 4 epochs of mini-batch PPO steps. When using GRAM to compute reward scores, this optimization objective is then defined as: LPPO = Ex DPPO,ˆy πθ [γ rϕ(x, ˆy))] α DKL [πθ(ˆy|x)||πθref(ˆy|x)] (17) where γ denotes a scaling factor... Here, we set γ to 10. Additionally, to address the over-optimization issue... we evaluated checkpoints at intervals of 200 steps for all tasks against their respective validation sets and selected the optimal checkpoint with the best reward score. Following Wang et al. (2024b), we also employed a cold-start trick for PPO to alleviate the damage caused by the inaccurate estimation of the early value model. Specifically, we only updated the value model and did not update the policy model during the first 30 steps of PPO training. Following Wang et al. (2024c) s work, we also standardized our reward scores using a reward queue, which stored the previous 1k reward scores to compute the mean and variance. |