reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PILAF: Optimal Human Preference Sampling for Reward Modeling

Authors: Yunzhen Feng, Ariel Kwiatkowski, Kunhao Zheng, Julia Kempe, Yaqi Duan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments to validate PILAF s effectiveness and robustness. As a stand-in for expensive human annotators, we use a well-trained reward model Skywork Llama-3.1-8B (Liu et al., 2024a) as a proxy for the oracle reward. Throughout training, we query this model exclusively for preference labels, simulating human feedback. We then align the Llama-3.1-8B base model (Dubey et al., 2024) using these proxy-labeled preference data in two settings: iterative DPO (Xiong et al., 2024) and online DPO (Guo et al., 2024).
Researcher Affiliation	Collaboration	Yunzhen Feng 1 Ariel Kwiatkowski 2 * Kunhao Zheng 2 * Julia Kempe 2 1 Yaqi Duan 1 1 New York University 2 Meta FAIR
Pseudocode	Yes	We formalize our final algorithm in Algorithm 1. Algorithm 1 DPO with PILAF (ours). input Prompt Dataset Dρ, preference oracle O, πθ, πref. 1: for step t = 1, ..., T do 2: Sample nt prompts {xi}nt i=1 from Dρ. 3: With probability 1/2, sample ya i , yb i πθ; with probability 1/2, sample ya i π+ θ and yb i π θ . 4: Query O to label (xi, ya i , yb i) into (xi, yw i , yℓ i). 5: Update πθt with DPO loss using {(xi, yw i , yℓ i)}nt i=1. 6: end for
Open Source Code	No	We implement our code based on the open-sourced Open RLHF framework Hu et al. (2024). We will open-source our code in the camera-ready version.
Open Datasets	Yes	We align the Llama-3.1-8B base model (Dubey et al., 2024) in terms of helpfulness and harmlessness using the HH-RLHF dataset (Bai et al., 2022), a widely-used benchmark dataset for alignment.
Dataset Splits	Yes	It consists of 161k prompts in the training set. For response preference labeling, we use a well-trained reward model to simulate human preferences by assigning preference to pairs of responses under the BT assumption in Equation (1). Specifically, we employ the Skywork-Reward-8B model (Liu et al., 2024a), a top-performing 8B model on Reward Bench (Lambert et al., 2024), as our oracle O. During training, interaction with this reward model is limited to providing two responses for comparison. We set β = 0.1 in all the experiments. ... Evaluation. We present our results using the reward-KL curve, following Gao et al. (2023), with the reward evaluated by the oracle reward model O. To monitor the impact of our sampling scheme on the optimization trajectory, we evaluate the model every 50 gradient steps during training. We use the entire testset of HH-RLHF (8.55K samples) to evaluate.
Hardware Specification	No	Due to resource constraints, our evaluations were conducted using 8B models and a reward model to simulate human feedback. ... We implement our code based on the open-sourced Open RLHF framework Hu et al. (2024). ... The paper mentions "resource constraints" and performing evaluations with "8B models" but does not specify any particular GPU models, CPU types, or other hardware components used for running the experiments.
Software Dependencies	No	We implement our code based on the open-sourced Open RLHF framework Hu et al. (2024). We will open-source our code in the camera-ready version. ... The paper mentions the "Open RLHF framework" and specific LLMs used (Llama-3.1-8B, Skywork-Reward-8B), but it does not provide version numbers for any key software components like programming languages, libraries (e.g., PyTorch, TensorFlow), or other specific solvers.
Experiment Setup	Yes	For SFT, we apply full-parameter tuning with Adam for one epoch, using a cosine learning rate schedule, a 3% warmup phase, a learning rate of 5 * 10^-7, and a batch size of 256. These hyperparameters are adopted from Hu et al. (2024). For all the DPO training in both iterative and online settings, we use full-parameter tuning with Adam but with two epochs. The learning rate, warmup schedules, and batch size are all the same. ... We set β = 0.1 in all the experiments. ... During generation, we limit the maximum number of new tokens to 896 and employ top p decoding with p = 0.95 for all experiments. For Online DPO, we use a sampling temperature of 1.0, following Guo et al. (2024), while in Iterative DPO, we set the temperature to 0.7 to account for the off-policy nature of the data, following Dong et al. (2024); Shi et al. (2024).