reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Authors: Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that RM-BENCH strongly correlates with policy model performance, making it a reliable reference for selecting reward models to align language models effectively. We evaluate nearly 40 reward models on RM-BENCH.
Researcher Affiliation	Academia	1Fudan University, 2Tsinghua University, 3Hong Kong University of Science and Technology EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methodologies and processes (e.g., RM-BENCH construction, metrics calculation) in natural language and using equations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Related code and data are available at https://github.com/THU-KEG/RM-Bench.
Open Datasets	Yes	Related code and data are available at https://github.com/THU-KEG/RM-Bench.
Dataset Splits	Yes	For each prompt x, we compare the chosen and rejected responses across three style levels: concise y , detailed y L, and detailed with Markdown formatting y L,M. This allows us to evaluate reward models ability to distinguish between chosen and rejected responses independently of stylistic differences.
Hardware Specification	No	The paper mentions using gpt-4o for response generation but does not specify any hardware used for running the experiments or evaluating the reward models.
Software Dependencies	No	The paper mentions various language models and frameworks (e.g., PPO, DPO, gpt-4o, Llama-3.1-8B, Nemotron-340B-Reward) but does not provide specific version numbers for any software dependencies used in their experimental setup.
Experiment Setup	Yes	Specifically, we first fine-tuned LLa MA-3-8B using the Tulu-v2 dataset to create the SFT model, followed by PPO training with the Ultrafeedback dataset. For PPO, we used Adam W with a learning rate of 1e 6, a batch size of 64, and a linear warmup scheduler for 10% of the total steps.