RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Authors: Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that RM-BENCH strongly correlates with policy model performance, making it a reliable reference for selecting reward models to align language models effectively. We evaluate nearly 40 reward models on RM-BENCH.
Researcher Affiliation Academia 1Fudan University, 2Tsinghua University, 3Hong Kong University of Science and Technology EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methodologies and processes (e.g., RM-BENCH construction, metrics calculation) in natural language and using equations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Related code and data are available at https://github.com/THU-KEG/RM-Bench.
Open Datasets Yes Related code and data are available at https://github.com/THU-KEG/RM-Bench.
Dataset Splits Yes For each prompt x, we compare the chosen and rejected responses across three style levels: concise y , detailed y L, and detailed with Markdown formatting y L,M. This allows us to evaluate reward models ability to distinguish between chosen and rejected responses independently of stylistic differences.
Hardware Specification No The paper mentions using gpt-4o for response generation but does not specify any hardware used for running the experiments or evaluating the reward models.
Software Dependencies No The paper mentions various language models and frameworks (e.g., PPO, DPO, gpt-4o, Llama-3.1-8B, Nemotron-340B-Reward) but does not provide specific version numbers for any software dependencies used in their experimental setup.
Experiment Setup Yes Specifically, we first fine-tuned LLa MA-3-8B using the Tulu-v2 dataset to create the SFT model, followed by PPO training with the Ultrafeedback dataset. For PPO, we used Adam W with a learning rate of 1e 6, a batch size of 64, and a linear warmup scheduler for 10% of the total steps.