RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Authors: Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that RM-BENCH strongly correlates with policy model performance, making it a reliable reference for selecting reward models to align language models effectively. We evaluate nearly 40 reward models on RM-BENCH. |
| Researcher Affiliation | Academia | 1Fudan University, 2Tsinghua University, 3Hong Kong University of Science and Technology EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methodologies and processes (e.g., RM-BENCH construction, metrics calculation) in natural language and using equations, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Related code and data are available at https://github.com/THU-KEG/RM-Bench. |
| Open Datasets | Yes | Related code and data are available at https://github.com/THU-KEG/RM-Bench. |
| Dataset Splits | Yes | For each prompt x, we compare the chosen and rejected responses across three style levels: concise y , detailed y L, and detailed with Markdown formatting y L,M. This allows us to evaluate reward models ability to distinguish between chosen and rejected responses independently of stylistic differences. |
| Hardware Specification | No | The paper mentions using gpt-4o for response generation but does not specify any hardware used for running the experiments or evaluating the reward models. |
| Software Dependencies | No | The paper mentions various language models and frameworks (e.g., PPO, DPO, gpt-4o, Llama-3.1-8B, Nemotron-340B-Reward) but does not provide specific version numbers for any software dependencies used in their experimental setup. |
| Experiment Setup | Yes | Specifically, we first fine-tuned LLa MA-3-8B using the Tulu-v2 dataset to create the SFT model, followed by PPO training with the Ultrafeedback dataset. For PPO, we used Adam W with a learning rate of 1e 6, a batch size of 64, and a linear warmup scheduler for 10% of the total steps. |