reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Comparing Comparisons: Informative and Easy Human Feedback with Distinguishability Queries

Authors: Xuening Feng, Zhaohui Jiang, Timo Kaufmann, Eyke Hüllermeier, Paul Weng, Yifei Zhu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate the potential of our method for faster, data-efficient learning and improved user-friendliness in RLHF benchmarks, particularly in classical control settings where preference strength is critical for expected utility maximization. Experimental results demonstrate the potential of our method for faster, data-efficient learning and improved user-friendliness in RLHF benchmarks, particularly in classical control settings where preference strength is critical for expected utility maximization. Experimental results demonstrate the potential of our method for faster, data-efficient learning and improved user-friendliness in RLHF benchmarks, particularly in classical control settings where preference strength is critical for expected utility maximization.
Researcher Affiliation	Academia	1Shanghai Jiao Tong University, Shanghai, China 2Institute for Informatics, LMU Munich, Munich, Germany 3Munich Center of Machine Learning (MCML), Munich, Germany 4German Research Center for Artificial Intelligence (DFKI) 5Digital Innovation Research Center, Duke Kunshan University, Kunshan, China.
Pseudocode	Yes	The pseudo-code for the general procedure can be found in Algorithm 1 in Appendix C. Algorithm 1 lists the pseudo-code of our algorithm, differentiating between PEBBLE and our changes with the use of color (changes in orange).
Open Source Code	No	The paper does not provide concrete access to source code. It only mentions: "Videos of selected queries and evaluation of trained agents for both methods are available at https://zenodo.org/records/15606992." This link is for videos, not source code.
Open Datasets	Yes	Similar to prior works (Lee et al., 2021b;a; Park et al., 2022; Liang et al., 2022; Liu et al., 2022; Hu et al., 2024), we consider a series of locomotion tasks from the Deep Mind Control Suite (DMControl) (Tassa et al., 2018) and robotic manipulation tasks from the Meta-World benchmark (Yu et al., 2019).
Dataset Splits	No	The paper discusses collecting trajectories and sampling segments to generate queries for reward learning, but it does not specify explicit train/test/validation splits for static datasets used in experiments. The process is described as online and interactive, where data is continuously generated and feedback is collected.
Hardware Specification	Yes	For all experiments, we only need one GPU card to launch experiments. Experiments were carried out on different platforms, including Ge Force RTX 3060 12G GPU + 48GB memory + Intel Core i7-10700F, or Ge Force RTX 3060 12G GPU + 64GB memory + Intel Core i7-12700, or Ge Force RTX 2070 SUPER + 32GB memory + Intel Core i7-9700, or Ge Force RTX 2060 + 64GB memory + Intel Core i7-8700.
Software Dependencies	No	The paper mentions using "Soft Actor-Critic (SAC) algorithm (Haarnoja et al., 2018)" and that "Dist Q is implemented based on the PEBBLE framework." However, it does not provide specific version numbers for these or any other key software libraries, programming languages, or environments needed for replication.
Experiment Setup	Yes	Table 2. Hyperparameters setting. HYPERPARAMETER VALUE. General settings: Initial temperature 0.1, Hidden units per each layer 1024(DMControl) 256(Meta-world), # of layers 2(DMControl) 3(Meta-world), Learning rate 0.0003 (Meta-world) 0.0005 (Walker) 0.0001 (Quadruped, Humanoid), Batch Size 1024(DMControl) 512(Meta-world), Optimizer Adam, Critic target update freq 2, Critic EMA τ 0.005, Discount γ 0.99, Frequency of feedback 5000 (Meta-world, Humanoid) Maximum budget / 20000 (Walker) 30000 (Quadruped), # of queries per session 10000/50, 3000/30, 2000/100, 400/10 (Meta-world), # of ensemble models Nen 3, # of pre-training steps 10000. Other settings for Dist Q: Loss weights (λd, λp) (1, 1), Size of Qp 10 # of queries per session, Size of Qpv (n I) 5 # of queries per session, Size of Qpve (2n E) 2 # of queries per session.