reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HPS: Hard Preference Sampling for Human Preference Alignment

Authors: Xiandong Zou, Wanyu Lin, Yuchen Li, Pan Zhou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on HH-RLHF and PKU-Safety datasets validate HPS s effectiveness, achieving comparable BLEU and reward scores while greatly improving reward margins and thus reducing harmful content generation. [...] experimental results demonstrate that HPS outperforms state-of-the-arts (So TAs) in both fine-tuning and transfer learning settings.
Researcher Affiliation	Academia	1Singapore Management University 2The Hong Kong Polytechnic University.
Pseudocode	No	The paper describes the methodology using mathematical equations and prose (e.g., Section 4: Methodology), but it does not include a clearly labeled pseudocode block or algorithm section.
Open Source Code	Yes	The source code is available at https://github.com/LVLab-SMU/HPS.
Open Datasets	Yes	Experiments on HH-RLHF and PKU-Safety datasets validate HPS s effectiveness [...] We use two popular datasets, HH-RLHF (Bai et al., 2022a) and PKU-Safe RLHF (Ji et al., 2024b), focusing on helpfulness and safety (Lambert et al., 2024; Fourrier et al., 2024).
Dataset Splits	No	The paper mentions using 'HH-RLHF test dataset' and 'PKU-Safety test dataset' for evaluations and a user study. However, it does not explicitly provide the overall train/validation/test splits (e.g., percentages or counts) used for the main model training, nor does it cite a specific standard split configuration it adheres to for all experiments.
Hardware Specification	Yes	We utilize 8 x L40-S GPUs for data augmentation and annotation. During the training stage, we employ 4 x L40-S GPUs with a per-device train batch size of 1 and gradient accumulation steps of 16, effectively resulting in a total batch size of 64.
Software Dependencies	No	The paper mentions applying LoRA and using the AdamW optimizer, but it does not specify version numbers for any key software components like Python, PyTorch, or CUDA libraries.
Experiment Setup	Yes	Due to computational constraints, we apply Lo RA (Hu et al., 2021) for efficient fine-tuning with a rank of 8 and scaling factor α = 16. The KL penalty strength β is set to 0.1, following DPO. [...] We fine-tune all methods for 2 epochs use Adam W optimizer (Loshchilov, 2017) with a learning rate of 5.0 10 7 over 2 and a cosine learning rate scheduler. [...] with a sampling temperature of 0.9 during inference.