Provably Efficient Exploration in Inverse Constrained Reinforcement Learning

Authors: Bo Yue, Jian Li, Guiliang Liu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To empirically study how well our method captures the accurate constraint, we conduct evaluations under different environments. The experimental results show that PCSE significantly outperforms other exploration strategies and applies to continuous environments.
Researcher Affiliation Academia 1School of Data Science, The Chinese University of Hong Kong, Shenzhen 2Stony Brook University, New York. Correspondence to: Guiliang Liu <EMAIL>.
Pseudocode Yes Algorithm 1 BEAR and PCSE for ICRL in an unknown environment
Open Source Code No The paper states: "Our implementation of code for discrete environments is adapted from (Liu et al., 2023), and for continuous environments, it is adapted from (Lazcano et al., 2024)." While it references external code, it does not provide a specific link or explicit statement that *their* implementation of the described methodology is open-source or publicly available.
Open Datasets Yes Our implementation of code for discrete environments is adapted from (Liu et al., 2023), and for continuous environments, it is adapted from (Lazcano et al., 2024). Lazcano, R., Andreas, K., Tai, J. J., Lee, S. R., and Terry, J. Gymnasium robotics, 2024. URL http://github.com/Farama-Foundation/ Gymnasium-Robotics. Point Maze. In this environment, we create a map of 5m 5m, where the area of each cell is 1m 1m.
Dataset Splits No The paper describes generating data through interaction with custom Gridworld and Point Maze environments. While it specifies details about the environments and the number of episodes, it does not mention traditional dataset splits (e.g., train/test/validation percentages or counts) for a pre-collected dataset, as data is collected online during the reinforcement learning process.
Hardware Specification Yes We ran experiments on a desktop computer with Intel(R) Core(TM) i5-14400F and NVIDIA Ge Force RTX 2080 Ti.
Software Dependencies No The paper mentions adapting code from other works (Liu et al., 2023; Lazcano et al., 2024) and using algorithms like Deep Q Network (DQN) and Proximal Policy Optimization (PPO). However, it does not provide specific version numbers for any software dependencies, such as programming languages, libraries, or frameworks used in their implementation.
Experiment Setup Yes In this paper, we create a map with dimensions of 7 7 units and define four distinct settings... The agent starts in the lower left cell (0, 0), and it has 8 actions... The reward in the reward state cell is 1, while all other cells have a 0 reward. The cost in a constraint location is also 1. The game continues until a maximum time step of 50 is reached. We plot the mean and 68% confidence interval (1-sigma error bar) computed with 5 random seeds (123456, 123, 1234, 36, 34) and exploration episodes ne = 1. The ϵ-greedy strategy selects an action based on the ϵ-greedy algorithm, balancing exploration and exploitation with the exploration parameter ϵ = 1/√t. We first train a Deep Q Network (DQN) in advance... For algorithm BEAR, Proximal Policy Optimization (PPO) is utilized to obtain the exploration policy πk. In this environment, we create a map of 5m 5m... The constraint is initially set at the cell centered at ( 1, 0)... The game terminates when a maximum time step of 500 is reached.