Rethinking Reward Modeling in Preference-based Large Language Model Alignment

Authors: Hao Sun, Yunyi Shen, Jean-Francois Ton

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we conduct extensive experiments covering 6 base LLMs, 2 datasets, 3 response sampling methods, 6 annotation noise levels, 3 reward model implementations, 4 annotation availability scenarios, and 5 random seeds resulting in over 12,000 runs.
Researcher Affiliation Collaboration Hao Sun , Yunyi Shen , Jean-Francois Ton University of Cambridge, Massachusetts Institute of Technology, Byte Dance Research EMAIL, EMAIL, EMAIL.
Pseudocode No The paper describes methods in regular paragraph text and equations but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No To enhance the reproducibility of our work, all code, datasets (demonstrations), fine-tuned LLMs, generated training and test responses, annotations of those responses, and their embeddings will be made publicly available.
Open Datasets Yes Datasets. We used the Anthropic-Harmless and Anthropic-Helpful datasets (Bai et al., 2022a), as these are extensively studied in the context of reward modeling, and open-source golden reward models are available (Yang et al., 2024b; Dong et al., 2023; 2024).
Dataset Splits Yes The Harmless dataset contains 41876 training prompts and 2273 test prompts. The Helpful dataset contains 42846 training prompts, and 2292 test prompts.
Hardware Specification Yes Our experiments are conducted on a cluster having 128 Intel(R) Xeon(R) Platinum 8336C CPUs @2.30GHz with NVIDIA V100 32GB or NVIDIA A100 80G GPU nodes.
Software Dependencies No We use vllm (Kwon et al., 2023) to accelerate the LLM generation process. The SFT takes less than 10 hours (4 hours for the 2b models) using A100 GPUs and the TRL framework (von Werra et al., 2020). For LGB models, we use the default hyper-parameter setting of .... The paper mentions software but does not specify their version numbers.
Experiment Setup Yes For MLPs, we use a minimalism three-layer-feed-forward structure of 1 hyper-param-mlp = { 2 activation : Re LU , 3 units : (1024, 512, 1) , 4 loss : BCELoss , 5 optimizer : Adam , 6 lr : 0.001 , 7 early_stop_patience : 3 , 8 max_epoch : 30