Deep Reinforcement Learning from Hierarchical Preference Design

Authors: Alexander Bukharin, Yixiao Li, Pengcheng He, Tuo Zhao

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply HERON to several RL applications, and we find that our framework can not only train high performing agents on a variety of difficult tasks, but also provide additional benefits such as improved sample efficiency and robustness. We empirically validate HERON framework through extensive experiments on real world applications: Traffic Light Control. Code Generation. Language Model Alignment. Robotic Control.
Researcher Affiliation Collaboration 1NVIDIA 2Georgia Institute of Technology 3Zoom Communications. Correspondence to: Alexander Bukharin <EMAIL>.
Pseudocode No The paper describes the methodology in Section 3 and illustrates the preference elicitation process with Figure 1, but it does not contain a clearly labeled pseudocode or algorithm block. The steps are described in regular paragraph text.
Open Source Code No The paper does not contain an explicit statement about the release of their source code or a link to a code repository for the methodology described in the paper. It mentions using third-party tools like QCOMBO, Flow framework, Code T5 models, and LoRA, but not their own implementation code.
Open Datasets Yes We primarily evaluate HERON on APPS, a python programming datasets containing 5000 test problems (Hendrycks et al., 2021). Each question in the dataset comes with expert demonstrations and test cases the program should pass. To evaluate the generalization ability of the policies, we evaluate each policy in a zero-shot manner on the MBPP dataset, which contains 974 basic python programming questions (Austin et al., 2021). For this experiment we employ Phi-2 (Javaheripi et al.) as our base model, and train it on the Help Steer dataset (Wang et al., 2023).
Dataset Splits No The paper mentions training on the 'Help Steer dataset' and evaluating on the 'Help Steer test dataset', implying a split. For APPS and MBPP, it refers to them as 'test problems' or datasets for evaluation. However, it does not provide specific ratios, counts, or explicit methodologies for how these datasets are partitioned into training, validation, and test sets to allow for exact reproduction of the data partitioning.
Hardware Specification Yes For the classic control tasks and traffic light control experiment we run experiments on Intel Xeon 6154 CPUs. For the code generation task, we train with Tesla V100 32GB GPUs.
Software Dependencies No The paper mentions several software components and frameworks used, such as Open AI gym, QCOMBO, Flow framework, Code T5 models, Py Bullet simulator, and LoRA. However, it does not provide specific version numbers for any of these software dependencies, which are necessary for reproducible replication.
Experiment Setup Yes For the classic control experiments we use the DDPG algorithm, where the policies are parameterized by three layer MLPs with 256 hidden units per layer. We use the Adam optimizer, and search for a learning rate in [1e-5, 1e-3]. For mountain car we train for a total of 15000 timesteps and begin training after 5000 timesteps. For pendulum, we train for a total of 50000 timesteps and begin learning after 25000 timesteps. To train the initial behavior model we use behavior cloning (supervised fine-tuning)...train with the cross-entropy loss for 12000 iterations, using a batch size of 64. We use the Adam optimizer with a learning rate of 2e-5. For the SFT base model, we train for two epochs with learning rate 5e-5. We use batch size 32 and train for 2 epochs. For Reinforce we also use learning rate 5e-5, batch size 32, and train for 2 epochs. For DPO, we use learning rate 5e-5, batch size 32, β = 0.1, and train for 2 epochs.