HPS: Hard Preference Sampling for Human Preference Alignment
Authors: Xiandong Zou, Wanyu Lin, Yuchen Li, Pan Zhou
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on HH-RLHF and PKU-Safety datasets validate HPS s effectiveness, achieving comparable BLEU and reward scores while greatly improving reward margins and thus reducing harmful content generation. [...] experimental results demonstrate that HPS outperforms state-of-the-arts (So TAs) in both fine-tuning and transfer learning settings. |
| Researcher Affiliation | Academia | 1Singapore Management University 2The Hong Kong Polytechnic University. |
| Pseudocode | No | The paper describes the methodology using mathematical equations and prose (e.g., Section 4: Methodology), but it does not include a clearly labeled pseudocode block or algorithm section. |
| Open Source Code | Yes | The source code is available at https://github.com/LVLab-SMU/HPS. |
| Open Datasets | Yes | Experiments on HH-RLHF and PKU-Safety datasets validate HPS s effectiveness [...] We use two popular datasets, HH-RLHF (Bai et al., 2022a) and PKU-Safe RLHF (Ji et al., 2024b), focusing on helpfulness and safety (Lambert et al., 2024; Fourrier et al., 2024). |
| Dataset Splits | No | The paper mentions using 'HH-RLHF test dataset' and 'PKU-Safety test dataset' for evaluations and a user study. However, it does not explicitly provide the overall train/validation/test splits (e.g., percentages or counts) used for the main model training, nor does it cite a specific standard split configuration it adheres to for all experiments. |
| Hardware Specification | Yes | We utilize 8 x L40-S GPUs for data augmentation and annotation. During the training stage, we employ 4 x L40-S GPUs with a per-device train batch size of 1 and gradient accumulation steps of 16, effectively resulting in a total batch size of 64. |
| Software Dependencies | No | The paper mentions applying LoRA and using the AdamW optimizer, but it does not specify version numbers for any key software components like Python, PyTorch, or CUDA libraries. |
| Experiment Setup | Yes | Due to computational constraints, we apply Lo RA (Hu et al., 2021) for efficient fine-tuning with a rank of 8 and scaling factor α = 16. The KL penalty strength β is set to 0.1, following DPO. [...] We fine-tune all methods for 2 epochs use Adam W optimizer (Loshchilov, 2017) with a learning rate of 5.0 10 7 over 2 and a cosine learning rate scheduler. [...] with a sampling temperature of 0.9 during inference. |