Comparing Comparisons: Informative and Easy Human Feedback with Distinguishability Queries

Authors: Xuening Feng, Zhaohui Jiang, Timo Kaufmann, Eyke Hüllermeier, Paul Weng, Yifei Zhu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the potential of our method for faster, data-efficient learning and improved user-friendliness in RLHF benchmarks, particularly in classical control settings where preference strength is critical for expected utility maximization. Experimental results demonstrate the potential of our method for faster, data-efficient learning and improved user-friendliness in RLHF benchmarks, particularly in classical control settings where preference strength is critical for expected utility maximization. Experimental results demonstrate the potential of our method for faster, data-efficient learning and improved user-friendliness in RLHF benchmarks, particularly in classical control settings where preference strength is critical for expected utility maximization.
Researcher Affiliation Academia 1Shanghai Jiao Tong University, Shanghai, China 2Institute for Informatics, LMU Munich, Munich, Germany 3Munich Center of Machine Learning (MCML), Munich, Germany 4German Research Center for Artificial Intelligence (DFKI) 5Digital Innovation Research Center, Duke Kunshan University, Kunshan, China.
Pseudocode Yes The pseudo-code for the general procedure can be found in Algorithm 1 in Appendix C. Algorithm 1 lists the pseudo-code of our algorithm, differentiating between PEBBLE and our changes with the use of color (changes in orange).
Open Source Code No The paper does not provide concrete access to source code. It only mentions: "Videos of selected queries and evaluation of trained agents for both methods are available at https://zenodo.org/records/15606992." This link is for videos, not source code.
Open Datasets Yes Similar to prior works (Lee et al., 2021b;a; Park et al., 2022; Liang et al., 2022; Liu et al., 2022; Hu et al., 2024), we consider a series of locomotion tasks from the Deep Mind Control Suite (DMControl) (Tassa et al., 2018) and robotic manipulation tasks from the Meta-World benchmark (Yu et al., 2019).
Dataset Splits No The paper discusses collecting trajectories and sampling segments to generate queries for reward learning, but it does not specify explicit train/test/validation splits for static datasets used in experiments. The process is described as online and interactive, where data is continuously generated and feedback is collected.
Hardware Specification Yes For all experiments, we only need one GPU card to launch experiments. Experiments were carried out on different platforms, including Ge Force RTX 3060 12G GPU + 48GB memory + Intel Core i7-10700F, or Ge Force RTX 3060 12G GPU + 64GB memory + Intel Core i7-12700, or Ge Force RTX 2070 SUPER + 32GB memory + Intel Core i7-9700, or Ge Force RTX 2060 + 64GB memory + Intel Core i7-8700.
Software Dependencies No The paper mentions using "Soft Actor-Critic (SAC) algorithm (Haarnoja et al., 2018)" and that "Dist Q is implemented based on the PEBBLE framework." However, it does not provide specific version numbers for these or any other key software libraries, programming languages, or environments needed for replication.
Experiment Setup Yes Table 2. Hyperparameters setting. HYPERPARAMETER VALUE. General settings: Initial temperature 0.1, Hidden units per each layer 1024(DMControl) 256(Meta-world), # of layers 2(DMControl) 3(Meta-world), Learning rate 0.0003 (Meta-world) 0.0005 (Walker) 0.0001 (Quadruped, Humanoid), Batch Size 1024(DMControl) 512(Meta-world), Optimizer Adam, Critic target update freq 2, Critic EMA τ 0.005, Discount γ 0.99, Frequency of feedback 5000 (Meta-world, Humanoid) Maximum budget / 20000 (Walker) 30000 (Quadruped), # of queries per session 10000/50, 3000/30, 2000/100, 400/10 (Meta-world), # of ensemble models Nen 3, # of pre-training steps 10000. Other settings for Dist Q: Loss weights (λd, λp) (1, 1), Size of Qp 10 # of queries per session, Size of Qpv (n I) 5 # of queries per session, Size of Qpve (2n E) 2 # of queries per session.