Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF
Authors: Han Shen, Zhuoran Yang, Tianyi Chen
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our algorithms via simulations in the Stackelberg Markov game, RL from human feedback and incentive design. Lastly, we conduct experiments on example applications covered by our framework, including the Stackelberg game and RL from human feedback tasks. |
| Researcher Affiliation | Academia | Han Shen EMAIL Department of Electrical, Computer and System Engineering Rensselaer Polytechnic Institute Troy, NY 12180, USA Zhuoran Yang EMAIL Department of Statistics and Data Science Yale University New Haven, CT 06520, USA Tianyi Chen EMAIL Department of Electrical, Computer and System Engineering Rensselaer Polytechnic Institute Troy, NY 12180, USA |
| Pseudocode | Yes | Algorithm 1 PBRL: Penalty-based Bilevel RL Gradient-descent |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or a link to a code repository. The text only mentions the paper's license and attribution requirements, not the availability of code. |
| Open Datasets | Yes | We conduct our experiments in the Arcade Learning Environment (ALE) (Bellemare et al., 2013) through Open AI gym. |
| Dataset Splits | No | In Section 6.2, the paper describes data collection: "At the start of training, we collect 576 pairs of trajectories and warm up the reward predictor for 500 epochs. After training starts, we collect 16 new pairs per reward learning epoch. We only keep the last 3000 pairs in a buffer." However, it does not provide specific train/test/validation splits for the datasets used in the experiments. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware details such as GPU models, CPU types, or computer specifications used for running the experiments. It only refers to general aspects like "policy optimization" and "training starts". |
| Software Dependencies | No | The paper mentions "Open AI gym" and refers to "A2C" and "A3C (Mnih et al., 2016)" as policy gradient estimators, but it does not specify version numbers for these or any other software libraries, programming languages (e.g., Python), or frameworks (e.g., PyTorch, TensorFlow) used in the implementation. |
| Experiment Setup | Yes | For the independent policy gradient method... we set the learning rate as 0.1, and both the follower and the leader use Monte Carlo sampling with trajectory length 5 and batch size 16. For PBRL with value penalty, we set learning rate 0.1, penalty constant λ = 2, inner iteration number T = 1... For PBRL with the Bellman penalty, we use λ = 7 and inner iteration number T = 10 instead. For the Atari games, we use A2C... The policy and the critic share a common base model... The reward predictor has the same input... The reward predictor and the policy are trained synchronously. The reward predictor is updated for one epoch every 300 A2C update. We compare trajectories of 25 time steps... For policy learning, we set the actor-critic learning rate 0.0003, the entropy coefficient 0.01, the actor-critic batch size 16, initial upper-level loss coefficient 0.001 which decays every 3000 actor-critic gradient steps... for reward learning, we set reward predictor learning rate 0.0003, reward predictor batch size 64... For Beamrider, we change the actor-critic learning rate to 7e-5. For the PBRL algorithms, we set the learning rate as 0.1 and a penalty constant λ = 4. The policy gradients are given by Monte Carlo sampling with trajectory length 5 and batch size 24. To obtain ˆπk 1, ˆπk 2 at each outer iteration k, we run the policy gradient algorithm for a single iteration with a learning rate 0.1 at every outer iteration. For the meta-gradient method, we use the same learning rate, trajectory length and batch size as PBRL. The inner iteration number is 1. |