Robust Reward Alignment via Hypothesis Space Batch Cutting
Authors: Zhixian Xie, Haode Zhang, Yizhe Feng, Wanxin Jin
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method in a model predictive control setting across diverse tasks. The results demonstrate that our framework achieves comparable or superior performance to state-of-the-art methods in error-free settings while significantly outperforming existing methods when handling a high percentage of erroneous human preferences. |
| Researcher Affiliation | Academia | 1Arizona State University, Tempe AZ 85283, United States. 2Shanghai Jiao Tong University, Shanghai, China. |
| Pseudocode | Yes | Algorithm 1 Implementation for HSBC Algorithm |
| Open Source Code | Yes | Paper website and code can be accessed via link. |
| Open Datasets | No | The paper uses dm-control tasks (Cartpole-Swingup, Walker Walk, Humanoid-Standup) which are well-known simulation environments, but does not explicitly state that a pre-existing or newly generated *dataset* of trajectories or human preferences is publicly available or provide access information for such a dataset. The B-Pref reference is to a paper defining teacher models, not directly to a dataset resource. |
| Dataset Splits | No | The paper describes collecting human feedback (50 simulated, then 50 human preferences for Cart Pole; 100 simulated, then 100 human preferences for Walker) and uses learning checkpoints for evaluation. However, it does not explicitly provide information on how any dataset was split into training, validation, or test sets in the traditional sense. |
| Hardware Specification | No | The paper mentions 'GPU-accelerated sampling-based MPC' but does not specify any particular GPU model, CPU, memory, or other hardware components used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'MJX simulator', 'MPPI', and 'Adam optimizer', but it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | HSBC Algorithm settings: In all tasks, rewards are MLP models. Unless specially mentioned, batch size N = 10, ensemble size M = 16, and disagreement threshold η = 0.75. In dm-control tasks and Go2-Standup, trajectory pairs of the first few batches are generated by a random policy for better exploration. Refer to Appendix C for other settings. (Table 5 and Table 6 provide detailed parameters for learning and MPPI respectively). Appendix C states: 'For all tasks, we use an MLP with 3 hidden layers of 256 hidden units to parameterize reward functions. The activation function of the network is Re LU. ... The Adam (Kingma, 2014) optimizer parameter is identical for all dm-control and Go2-Standup tasks, the learning rate is set to be 0.005 with a weight decay coefficient 0.001. In dexterous manipulation tasks, the learning rate is 0.002 with a weight decay coefficient 0.001.' |