reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries

Authors: Ni Mu, Hao Hu, Xiao Hu, Yiqin Yang, Bo Xu, Qing-Shan Jia

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that CLARIFY outperforms baselines in both non-ideal teachers and real human feedback settings. Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings. Extensive experiments show the effectiveness CLARIFY. First, CLARIFY outperforms the state-of-the-art offline Pb RL methods under non-ideal feedback from both scripted teachers and real human labelers. We conduct human experiments to demonstrate that the queries selected by CLARIFY are more clearly distinguished, thereby improving human labeling efficiency. Finally, the visualization analysis of the learned embedding space reveals that segments with clearer distinctions are widely separated while similar segments are closely clustered together.
Researcher Affiliation	Collaboration	1Beijing Key Laboratory of Embodied Intelligence Systems, Department of Automation, Tsinghua University, Beijing, China 2Moonshot AI, Beijing, China 3The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China.
Pseudocode	Yes	The overall framework of CLARIFY is illustrated in Figure 1 and Algorithm 1, with implementation details provided in Section 4.3. Algorithm 1 The proposed offline Pb RL method using CLARIFY embedding
Open Source Code	Yes	The code repository of our method is: https://github.com/Moon Out Cloud Back/CLARIFY Pb RL
Open Datasets	Yes	Dataset and tasks. Previous offline Pb RL studies often use D4RL (Fu et al., 2020) for evaluation, but D4RL is shown to be insensitive to reward learning due to the survival instinct (Li et al., 2023), where performance can remain high even with wrong rewards (Shin et al., 2023). To address this, we use the offline dataset presented by Choi et al. (2024) with Metaworld (Yu et al., 2020) and DMControl (Tassa et al., 2018), which has been proven to be suitable for reward learning (Choi et al., 2024).
Dataset Splits	No	In offline Pb RL, the true reward function r is unknown, and we have an offline dataset D without reward signals. We request preference feedback p for two trajectory segments σ0 and σ1 of length H sampled from D. ... Policy learning: Label the offline dataset D using reward model ˆrθ Train the IQL policy πθ using the relabeled dataset D. The paper describes using an offline dataset (D) and labeling it for policy training, but it does not specify explicit training/validation/test splits for this dataset.
Hardware Specification	No	The paper does not explicitly provide specific hardware details (like GPU or CPU models, memory, etc.) used for running its experiments.
Software Dependencies	No	The paper mentions software components like 'Optimizer Adam', 'Activation function Re LU', 'Final activation Tanh' and refers to using BERT and GPT architectures within a Bi-directional Decision Transformer. However, it does not provide specific version numbers for any libraries (e.g., PyTorch 1.9, TensorFlow 2.x) or specific software packages, which are necessary for reproducible dependency description.
Experiment Setup	Yes	The hyperparameters for offline policy learning are provided in Table 9. The hyperparameters for both the baselines and our method are listed in Table 10. Table 9 details hyperparameters for Reward model (e.g., Optimizer Adam, Learning rate 3e-4, Batch size 128) and Policy learning (e.g., Critic, Actor, Value hidden dim 256, Learning rate 0.5). Table 10 provides specific hyperparameters for OPRL, PT, OPPO, and CLARIFY, including embedding dimensions, learning rates, batch sizes, dropout, and loss coefficients (e.g., λamb 0.1, λquad 1, λnorm 0.1).