Rethinking Reward Modeling in Preference-based Large Language Model Alignment
Authors: Hao Sun, Yunyi Shen, Jean-Francois Ton
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we conduct extensive experiments covering 6 base LLMs, 2 datasets, 3 response sampling methods, 6 annotation noise levels, 3 reward model implementations, 4 annotation availability scenarios, and 5 random seeds resulting in over 12,000 runs. |
| Researcher Affiliation | Collaboration | Hao Sun , Yunyi Shen , Jean-Francois Ton University of Cambridge, Massachusetts Institute of Technology, Byte Dance Research EMAIL, EMAIL, EMAIL. |
| Pseudocode | No | The paper describes methods in regular paragraph text and equations but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | To enhance the reproducibility of our work, all code, datasets (demonstrations), fine-tuned LLMs, generated training and test responses, annotations of those responses, and their embeddings will be made publicly available. |
| Open Datasets | Yes | Datasets. We used the Anthropic-Harmless and Anthropic-Helpful datasets (Bai et al., 2022a), as these are extensively studied in the context of reward modeling, and open-source golden reward models are available (Yang et al., 2024b; Dong et al., 2023; 2024). |
| Dataset Splits | Yes | The Harmless dataset contains 41876 training prompts and 2273 test prompts. The Helpful dataset contains 42846 training prompts, and 2292 test prompts. |
| Hardware Specification | Yes | Our experiments are conducted on a cluster having 128 Intel(R) Xeon(R) Platinum 8336C CPUs @2.30GHz with NVIDIA V100 32GB or NVIDIA A100 80G GPU nodes. |
| Software Dependencies | No | We use vllm (Kwon et al., 2023) to accelerate the LLM generation process. The SFT takes less than 10 hours (4 hours for the 2b models) using A100 GPUs and the TRL framework (von Werra et al., 2020). For LGB models, we use the default hyper-parameter setting of .... The paper mentions software but does not specify their version numbers. |
| Experiment Setup | Yes | For MLPs, we use a minimalism three-layer-feed-forward structure of 1 hyper-param-mlp = { 2 activation : Re LU , 3 units : (1024, 512, 1) , 4 loss : BCELoss , 5 optimizer : Adam , 6 lr : 0.001 , 7 early_stop_patience : 3 , 8 max_epoch : 30 |