Active Reward Modeling: Adaptive Preference Labeling for Large Language Model Alignment

Authors: Yunyi Shen, Hao Sun, Jean-Francois Ton

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our method demonstrates remarkable performance, high computational efficiency, and stability compared to other selection methods from deep learning and classical statistical literature across multiple opensource LLMs and datasets. Further ablation studies reveal that incorporating cross-prompt comparisons in active reward modeling significantly enhances labeling efficiency, shedding light on the potential for improved annotation strategies in RLHF.
Researcher Affiliation Collaboration 1EECS, MIT, Cambrdige, MA, USA 2University of Cambridge, Cambrdige, UK 3Byte Dance Research, London, UK. Correspondence to: Yunyi Shen <EMAIL>, Hao Sun <EMAIL>, Jean Franc ois Ton <EMAIL>.
Pseudocode Yes Algorithm 1 Model-based active learning
Open Source Code Yes Code and embeddings to reproduce all results of this paper are available at https: //github.com/Yunyi Shen/ARM-FI/. Embeddings for reward modeling and efficient Best-of-N testing are available at https://github.com/holarissun/ embedding-based-llm-alignment.
Open Datasets Yes We used the Anthropic Harmless and Helpful datasets (Bai et al., 2022a) that has been widely studied in reward modeling, and golden reward models are available (Yang et al., 2024; Dong et al., 2023; 2024).
Dataset Splits Yes The dataset includes 40k prompts with 10 responses each for training, and 2k prompts with 500 generations each for testing. ... At the beginning of each step, we randomly draw 500 prompts.
Hardware Specification No The paper mentions 're-training a 3-layer MLP with 10k annotations takes only a few minutes on a GPU server' and 'YS acknowledge the MIT Super Cloud and Lincoln Laboratory Supercomputing Center for providing HPC resources that have contributed to the research results reported within this paper.' These descriptions are too vague and do not provide specific hardware models (e.g., specific GPU or CPU models).
Software Dependencies No The paper does not explicitly state specific software dependencies with version numbers, such as Python versions or specific library versions (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup Yes Reward Modeling. To separate representation learning from reward modeling, we train our reward model using joint embeddings of prompts and responses. An MLP with three hidden layers and BT loss was used. ... We test different annotation batch sizes, an important hyperparameter to tune, ranging from 125, 250, 500, 1000 to understand performance across various settings.