reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rethinking Reward Modeling in Preference-based Large Language Model Alignment

Authors: Hao Sun, Yunyi Shen, Jean-Francois Ton

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we conduct extensive experiments covering 6 base LLMs, 2 datasets, 3 response sampling methods, 6 annotation noise levels, 3 reward model implementations, 4 annotation availability scenarios, and 5 random seeds resulting in over 12,000 runs.
Researcher Affiliation	Collaboration	Hao Sun , Yunyi Shen , Jean-Francois Ton University of Cambridge, Massachusetts Institute of Technology, Byte Dance Research EMAIL, EMAIL, EMAIL.
Pseudocode	No	The paper describes methods in regular paragraph text and equations but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	To enhance the reproducibility of our work, all code, datasets (demonstrations), fine-tuned LLMs, generated training and test responses, annotations of those responses, and their embeddings will be made publicly available.
Open Datasets	Yes	Datasets. We used the Anthropic-Harmless and Anthropic-Helpful datasets (Bai et al., 2022a), as these are extensively studied in the context of reward modeling, and open-source golden reward models are available (Yang et al., 2024b; Dong et al., 2023; 2024).
Dataset Splits	Yes	The Harmless dataset contains 41876 training prompts and 2273 test prompts. The Helpful dataset contains 42846 training prompts, and 2292 test prompts.
Hardware Specification	Yes	Our experiments are conducted on a cluster having 128 Intel(R) Xeon(R) Platinum 8336C CPUs @2.30GHz with NVIDIA V100 32GB or NVIDIA A100 80G GPU nodes.
Software Dependencies	No	We use vllm (Kwon et al., 2023) to accelerate the LLM generation process. The SFT takes less than 10 hours (4 hours for the 2b models) using A100 GPUs and the TRL framework (von Werra et al., 2020). For LGB models, we use the default hyper-parameter setting of .... The paper mentions software but does not specify their version numbers.
Experiment Setup	Yes	For MLPs, we use a minimalism three-layer-feed-forward structure of 1 hyper-param-mlp = { 2 activation : Re LU , 3 units : (1024, 512, 1) , 4 loss : BCELoss , 5 optimizer : Adam , 6 lr : 0.001 , 7 early_stop_patience : 3 , 8 max_epoch : 30