PILAF: Optimal Human Preference Sampling for Reward Modeling
Authors: Yunzhen Feng, Ariel Kwiatkowski, Kunhao Zheng, Julia Kempe, Yaqi Duan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments to validate PILAF s effectiveness and robustness. As a stand-in for expensive human annotators, we use a well-trained reward model Skywork Llama-3.1-8B (Liu et al., 2024a) as a proxy for the oracle reward. Throughout training, we query this model exclusively for preference labels, simulating human feedback. We then align the Llama-3.1-8B base model (Dubey et al., 2024) using these proxy-labeled preference data in two settings: iterative DPO (Xiong et al., 2024) and online DPO (Guo et al., 2024). |
| Researcher Affiliation | Collaboration | Yunzhen Feng 1 Ariel Kwiatkowski 2 * Kunhao Zheng 2 * Julia Kempe 2 1 Yaqi Duan 1 1 New York University 2 Meta FAIR |
| Pseudocode | Yes | We formalize our final algorithm in Algorithm 1. Algorithm 1 DPO with PILAF (ours). input Prompt Dataset Dρ, preference oracle O, πθ, πref. 1: for step t = 1, ..., T do 2: Sample nt prompts {xi}nt i=1 from Dρ. 3: With probability 1/2, sample ya i , yb i πθ; with probability 1/2, sample ya i π+ θ and yb i π θ . 4: Query O to label (xi, ya i , yb i) into (xi, yw i , yℓ i). 5: Update πθt with DPO loss using {(xi, yw i , yℓ i)}nt i=1. 6: end for |
| Open Source Code | No | We implement our code based on the open-sourced Open RLHF framework Hu et al. (2024). We will open-source our code in the camera-ready version. |
| Open Datasets | Yes | We align the Llama-3.1-8B base model (Dubey et al., 2024) in terms of helpfulness and harmlessness using the HH-RLHF dataset (Bai et al., 2022), a widely-used benchmark dataset for alignment. |
| Dataset Splits | Yes | It consists of 161k prompts in the training set. For response preference labeling, we use a well-trained reward model to simulate human preferences by assigning preference to pairs of responses under the BT assumption in Equation (1). Specifically, we employ the Skywork-Reward-8B model (Liu et al., 2024a), a top-performing 8B model on Reward Bench (Lambert et al., 2024), as our oracle O. During training, interaction with this reward model is limited to providing two responses for comparison. We set β = 0.1 in all the experiments. ... Evaluation. We present our results using the reward-KL curve, following Gao et al. (2023), with the reward evaluated by the oracle reward model O. To monitor the impact of our sampling scheme on the optimization trajectory, we evaluate the model every 50 gradient steps during training. We use the entire testset of HH-RLHF (8.55K samples) to evaluate. |
| Hardware Specification | No | Due to resource constraints, our evaluations were conducted using 8B models and a reward model to simulate human feedback. ... We implement our code based on the open-sourced Open RLHF framework Hu et al. (2024). ... The paper mentions "resource constraints" and performing evaluations with "8B models" but does not specify any particular GPU models, CPU types, or other hardware components used for running the experiments. |
| Software Dependencies | No | We implement our code based on the open-sourced Open RLHF framework Hu et al. (2024). We will open-source our code in the camera-ready version. ... The paper mentions the "Open RLHF framework" and specific LLMs used (Llama-3.1-8B, Skywork-Reward-8B), but it does not provide version numbers for any key software components like programming languages, libraries (e.g., PyTorch, TensorFlow), or other specific solvers. |
| Experiment Setup | Yes | For SFT, we apply full-parameter tuning with Adam for one epoch, using a cosine learning rate schedule, a 3% warmup phase, a learning rate of 5 * 10^-7, and a batch size of 256. These hyperparameters are adopted from Hu et al. (2024). For all the DPO training in both iterative and online settings, we use full-parameter tuning with Adam but with two epochs. The learning rate, warmup schedules, and batch size are all the same. ... We set β = 0.1 in all the experiments. ... During generation, we limit the maximum number of new tokens to 896 and employ top p decoding with p = 0.95 for all experiments. For Online DPO, we use a sampling temperature of 1.0, following Guo et al. (2024), while in Iterative DPO, we set the temperature to 0.7 to account for the off-policy nature of the data, following Dong et al. (2024); Shi et al. (2024). |