reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Active Reward Modeling: Adaptive Preference Labeling for Large Language Model Alignment

Authors: Yunyi Shen, Hao Sun, Jean-Francois Ton

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, our method demonstrates remarkable performance, high computational efficiency, and stability compared to other selection methods from deep learning and classical statistical literature across multiple opensource LLMs and datasets. Further ablation studies reveal that incorporating cross-prompt comparisons in active reward modeling significantly enhances labeling efficiency, shedding light on the potential for improved annotation strategies in RLHF.
Researcher Affiliation	Collaboration	1EECS, MIT, Cambrdige, MA, USA 2University of Cambridge, Cambrdige, UK 3Byte Dance Research, London, UK. Correspondence to: Yunyi Shen <EMAIL>, Hao Sun <EMAIL>, Jean Franc ois Ton <EMAIL>.
Pseudocode	Yes	Algorithm 1 Model-based active learning
Open Source Code	Yes	Code and embeddings to reproduce all results of this paper are available at https: //github.com/Yunyi Shen/ARM-FI/. Embeddings for reward modeling and efficient Best-of-N testing are available at https://github.com/holarissun/ embedding-based-llm-alignment.
Open Datasets	Yes	We used the Anthropic Harmless and Helpful datasets (Bai et al., 2022a) that has been widely studied in reward modeling, and golden reward models are available (Yang et al., 2024; Dong et al., 2023; 2024).
Dataset Splits	Yes	The dataset includes 40k prompts with 10 responses each for training, and 2k prompts with 500 generations each for testing. ... At the beginning of each step, we randomly draw 500 prompts.
Hardware Specification	No	The paper mentions 're-training a 3-layer MLP with 10k annotations takes only a few minutes on a GPU server' and 'YS acknowledge the MIT Super Cloud and Lincoln Laboratory Supercomputing Center for providing HPC resources that have contributed to the research results reported within this paper.' These descriptions are too vague and do not provide specific hardware models (e.g., specific GPU or CPU models).
Software Dependencies	No	The paper does not explicitly state specific software dependencies with version numbers, such as Python versions or specific library versions (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup	Yes	Reward Modeling. To separate representation learning from reward modeling, we train our reward model using joint embeddings of prompts and responses. An MLP with three hidden layers and BT loss was used. ... We test different annotation batch sizes, an important hyperparameter to tune, ranging from 125, 250, 500, 1000 to understand performance across various settings.