Policy-labeled Preference Learning: Is Preference Enough for RLHF?

Authors: Taehyun Cho, Seokhun Ju, Seungyub Han, Dohyeong Kim, Kyungjae Lee, Jungwoo Lee

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in high-dimensional continuous control tasks demonstrate PPL s significant improvements in offline RLHF performance and its effectiveness in online settings. For more information, visit our project page: https://jjush.github.io/PPL/. (...) Empirically, to consider the fact that real-world offline data often consists of rollouts from diverse policies, we construct homogeneous and heterogeneous datasets in the Meta World environment and evaluate performance across various offline datasets. (...) 4. Experiments (...) Table 2: Success rates of all methods across six tasks on the Meta World benchmark on different datasets. (...) Figure 6: Online learning curves across five Meta World tasks, comparing PPL and PEBBLE.
Researcher Affiliation Collaboration *Equal contribution 1Seoul National University, Seoul, South Korea 2Korea University, Seoul, South Korea 3Hodoo AI Labs, Seoul, South Korea. Correspondence to: Kyungjae Lee <EMAIL>, Jungwoo Lee <EMAIL>.
Pseudocode Yes Due to page limitations, see Appendix C for the pseudocode. (...) Appendix C. Pseudocode. Algorithm 1 Policy-labeled Preference Learning (PPL)
Open Source Code No For more information, visit our project page: https://jjush.github.io/PPL/.
Open Datasets Yes For a fair comparison, we first evaluate the performance of PPL on six robotic manipulation tasks in Meta World (Yu et al., 2020), using the same rollout data provided by Hejna et al. (2023).
Dataset Splits No The paper describes how different types of datasets (homogeneous/heterogeneous, dense/sparse) were constructed and how segments were sampled and labeled. For example, it states, "For preference datasets, we conduct experiments under two settings: Dense, where comparisons are made between all segment pairs, and Sparse, where only one comparison is made per segment." and "To evaluate performance in heterogeneous datasets, we further construct an additional offline dataset by rolling out suboptimal policies with 20% and 50% success rates and combining them.", and "We uniformly sampled segments of length 64 and assigned labels based on estimated regret." However, it does not explicitly provide percentages or sample counts for training, validation, and test splits for reproducibility. The constructed preference datasets are used directly for training without a clear division into distinct evaluation splits.
Hardware Specification No We would like to thank LG AI Research (Youngsoo Jang, Geonhyeong Kim, Yujin Kim, and Moontae Lee) for their valuable feedback and for providing GPU resources that supported parts of this research. The paper mentions "GPU resources" but does not specify any particular GPU model (e.g., NVIDIA A100, RTX 3090) or CPU details, which would be required for a specific hardware specification.
Software Dependencies No The paper mentions using "suboptimal soft actor-critic (SAC)" and refers to the "official CPL implementation" for a baseline, but does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used in their own implementation. For example, there's no mention of "Python 3.8" or "PyTorch 1.9."
Experiment Setup Yes E.1. Hyperparameter Setting. Table 3: Hyperparameter settings for offline implementation. Table 4: Hyperparameters for online implementation. Table 5: Hyperparameters for PPL, CPL, SFT, and P-IQL. These tables list specific values for various hyperparameters including 'Total Training Steps', 'Batch Size', 'Learning rates', 'Temperature α', 'Asymmetric regularizer λ', 'Actor Dropout', and 'Architecture'.