HPS: Hard Preference Sampling for Human Preference Alignment

Authors: Xiandong Zou, Wanyu Lin, Yuchen Li, Pan Zhou

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on HH-RLHF and PKU-Safety datasets validate HPS s effectiveness, achieving comparable BLEU and reward scores while greatly improving reward margins and thus reducing harmful content generation. [...] experimental results demonstrate that HPS outperforms state-of-the-arts (So TAs) in both fine-tuning and transfer learning settings.
Researcher Affiliation Academia 1Singapore Management University 2The Hong Kong Polytechnic University.
Pseudocode No The paper describes the methodology using mathematical equations and prose (e.g., Section 4: Methodology), but it does not include a clearly labeled pseudocode block or algorithm section.
Open Source Code Yes The source code is available at https://github.com/LVLab-SMU/HPS.
Open Datasets Yes Experiments on HH-RLHF and PKU-Safety datasets validate HPS s effectiveness [...] We use two popular datasets, HH-RLHF (Bai et al., 2022a) and PKU-Safe RLHF (Ji et al., 2024b), focusing on helpfulness and safety (Lambert et al., 2024; Fourrier et al., 2024).
Dataset Splits No The paper mentions using 'HH-RLHF test dataset' and 'PKU-Safety test dataset' for evaluations and a user study. However, it does not explicitly provide the overall train/validation/test splits (e.g., percentages or counts) used for the main model training, nor does it cite a specific standard split configuration it adheres to for all experiments.
Hardware Specification Yes We utilize 8 x L40-S GPUs for data augmentation and annotation. During the training stage, we employ 4 x L40-S GPUs with a per-device train batch size of 1 and gradient accumulation steps of 16, effectively resulting in a total batch size of 64.
Software Dependencies No The paper mentions applying LoRA and using the AdamW optimizer, but it does not specify version numbers for any key software components like Python, PyTorch, or CUDA libraries.
Experiment Setup Yes Due to computational constraints, we apply Lo RA (Hu et al., 2021) for efficient fine-tuning with a rank of 8 and scaling factor α = 16. The KL penalty strength β is set to 0.1, following DPO. [...] We fine-tune all methods for 2 epochs use Adam W optimizer (Loshchilov, 2017) with a learning rate of 5.0 10 7 over 2 and a cosine learning rate scheduler. [...] with a sampling temperature of 0.9 during inference.