CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries
Authors: Ni Mu, Hao Hu, Xiao Hu, Yiqin Yang, Bo Xu, Qing-Shan Jia
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that CLARIFY outperforms baselines in both non-ideal teachers and real human feedback settings. Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings. Extensive experiments show the effectiveness CLARIFY. First, CLARIFY outperforms the state-of-the-art offline Pb RL methods under non-ideal feedback from both scripted teachers and real human labelers. We conduct human experiments to demonstrate that the queries selected by CLARIFY are more clearly distinguished, thereby improving human labeling efficiency. Finally, the visualization analysis of the learned embedding space reveals that segments with clearer distinctions are widely separated while similar segments are closely clustered together. |
| Researcher Affiliation | Collaboration | 1Beijing Key Laboratory of Embodied Intelligence Systems, Department of Automation, Tsinghua University, Beijing, China 2Moonshot AI, Beijing, China 3The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. |
| Pseudocode | Yes | The overall framework of CLARIFY is illustrated in Figure 1 and Algorithm 1, with implementation details provided in Section 4.3. Algorithm 1 The proposed offline Pb RL method using CLARIFY embedding |
| Open Source Code | Yes | The code repository of our method is: https://github.com/Moon Out Cloud Back/CLARIFY Pb RL |
| Open Datasets | Yes | Dataset and tasks. Previous offline Pb RL studies often use D4RL (Fu et al., 2020) for evaluation, but D4RL is shown to be insensitive to reward learning due to the survival instinct (Li et al., 2023), where performance can remain high even with wrong rewards (Shin et al., 2023). To address this, we use the offline dataset presented by Choi et al. (2024) with Metaworld (Yu et al., 2020) and DMControl (Tassa et al., 2018), which has been proven to be suitable for reward learning (Choi et al., 2024). |
| Dataset Splits | No | In offline Pb RL, the true reward function r is unknown, and we have an offline dataset D without reward signals. We request preference feedback p for two trajectory segments σ0 and σ1 of length H sampled from D. ... Policy learning: Label the offline dataset D using reward model ˆrθ Train the IQL policy πθ using the relabeled dataset D. The paper describes using an offline dataset (D) and labeling it for policy training, but it does not specify explicit training/validation/test splits for this dataset. |
| Hardware Specification | No | The paper does not explicitly provide specific hardware details (like GPU or CPU models, memory, etc.) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like 'Optimizer Adam', 'Activation function Re LU', 'Final activation Tanh' and refers to using BERT and GPT architectures within a Bi-directional Decision Transformer. However, it does not provide specific version numbers for any libraries (e.g., PyTorch 1.9, TensorFlow 2.x) or specific software packages, which are necessary for reproducible dependency description. |
| Experiment Setup | Yes | The hyperparameters for offline policy learning are provided in Table 9. The hyperparameters for both the baselines and our method are listed in Table 10. Table 9 details hyperparameters for Reward model (e.g., Optimizer Adam, Learning rate 3e-4, Batch size 128) and Policy learning (e.g., Critic, Actor, Value hidden dim 256, Learning rate 0.5). Table 10 provides specific hyperparameters for OPRL, PT, OPPO, and CLARIFY, including embedding dimensions, learning rates, batch sizes, dropout, and loss coefficients (e.g., λamb 0.1, λquad 1, λnorm 0.1). |