reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust Reward Alignment via Hypothesis Space Batch Cutting

Authors: Zhixian Xie, Haode Zhang, Yizhe Feng, Wanxin Jin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method in a model predictive control setting across diverse tasks. The results demonstrate that our framework achieves comparable or superior performance to state-of-the-art methods in error-free settings while significantly outperforming existing methods when handling a high percentage of erroneous human preferences.
Researcher Affiliation	Academia	1Arizona State University, Tempe AZ 85283, United States. 2Shanghai Jiao Tong University, Shanghai, China.
Pseudocode	Yes	Algorithm 1 Implementation for HSBC Algorithm
Open Source Code	Yes	Paper website and code can be accessed via link.
Open Datasets	No	The paper uses dm-control tasks (Cartpole-Swingup, Walker Walk, Humanoid-Standup) which are well-known simulation environments, but does not explicitly state that a pre-existing or newly generated dataset of trajectories or human preferences is publicly available or provide access information for such a dataset. The B-Pref reference is to a paper defining teacher models, not directly to a dataset resource.
Dataset Splits	No	The paper describes collecting human feedback (50 simulated, then 50 human preferences for Cart Pole; 100 simulated, then 100 human preferences for Walker) and uses learning checkpoints for evaluation. However, it does not explicitly provide information on how any dataset was split into training, validation, or test sets in the traditional sense.
Hardware Specification	No	The paper mentions 'GPU-accelerated sampling-based MPC' but does not specify any particular GPU model, CPU, memory, or other hardware components used for running the experiments.
Software Dependencies	No	The paper mentions using 'MJX simulator', 'MPPI', and 'Adam optimizer', but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	HSBC Algorithm settings: In all tasks, rewards are MLP models. Unless specially mentioned, batch size N = 10, ensemble size M = 16, and disagreement threshold η = 0.75. In dm-control tasks and Go2-Standup, trajectory pairs of the first few batches are generated by a random policy for better exploration. Refer to Appendix C for other settings. (Table 5 and Table 6 provide detailed parameters for learning and MPPI respectively). Appendix C states: 'For all tasks, we use an MLP with 3 hidden layers of 256 hidden units to parameterize reward functions. The activation function of the network is Re LU. ... The Adam (Kingma, 2014) optimizer parameter is identical for all dm-control and Go2-Standup tasks, the learning rate is set to be 0.005 with a weight decay coefficient 0.001. In dexterous manipulation tasks, the learning rate is 0.002 with a weight decay coefficient 0.001.'