Tilted Quantile Gradient Updates for Quantile-Constrained Reinforcement Learning
Authors: Chenglin Li, Guangchun Ruan, Hua Geng
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that the proposed model fully meets safety requirements (quantile constraints) while outperforming the state-of-the-art benchmarks with higher return. In the Experiments section, simulation results demonstrate that the proposed model fully guarantee safety while outperforming the state-of-the-art benchmarks with higher returns. We evaluate the proposed TQPO on three classic safe RL tasks: Simple Env, Dynamic Env and Gremlin Env from Mujoco and Safety Gym. |
| Researcher Affiliation | Academia | 1Department of Automation, Tsinghua University, Beijing, China 2Laboratory for Information & Decision Systems, MIT, Boston, USA EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Overall, the training process of TQPO iterates as follows: Generate a batch of samples {s0, s1, . . . , s N} πθ Update the value network parameter ϕ Update three main parameters q, θ and λ as follows: qk+1 = qk + αk(ˆq1 ε qk) (17a) θk+1 = θk + βk θLθ (17b) λk+1 = λk + ηk(qk d) (17c) where αk, βk, and ηk are the update rates of the three parameters respectively. |
| Open Source Code | Yes | 1Code is available at https://github.com/Charlie Leeeee/TQPO |
| Open Datasets | Yes | We evaluate the proposed TQPO on three classic safe RL tasks: Simple Env, Dynamic Env and Gremlin Env from Mujoco and Safety Gym (Todorov, Erez, and Tassa 2012; Ray, Achiam, and Amodei 2019). |
| Dataset Splits | No | The paper describes generating batches of samples for training (e.g., 'Generate a batch of samples {s0, s1, . . . , s N} πθ') and mentions 'an episode of 1000 steps' and running experiments with 'five random seeds'. However, it does not specify explicit training/test/validation dataset splits with percentages or counts, as typically understood for pre-existing datasets in supervised learning. The data for RL is generated dynamically through interaction with the environment rather than being split from a fixed dataset. |
| Hardware Specification | No | The paper does not explicitly describe any specific hardware components (e.g., GPU models, CPU types, memory specifications) used to run the experiments. It mentions the simulation environments (Mujoco and Safety Gym) but not the underlying computational hardware. |
| Software Dependencies | No | The paper mentions that 'The algorithm is based on the classic RL algorithm Proximal Policy Optimization (PPO) (Schulman et al. 2017)' and refers to 'rlpyt: A research code base for deep reinforcement learning in pytorch. ar Xiv preprint ar Xiv:1909.01500.' in the references. However, it does not specify version numbers for any software, libraries, or frameworks like PPO, PyTorch, or rlpyt. |
| Experiment Setup | Yes | The cost threshold is set to d = 15. Since many safety-critical applications require a high safety probability above 90%, 1 ε = 90%, 95% are used in the experiments. All the experiments are conducted with five random seeds. Implementation details of TQPO can be found in Appendix B. Appendix B specifies: We use a discount factor γ = 0.99. The update rate for the quantile is α = 1e-4. The update rate for the policy network parameter is β = 3e-4. The smoothing factor δ in Eqn. (14) is set to 0.1. We use a batch size of 2048. We train for 500 epochs, with 4 steps per epoch. We use the Adam optimizer with a learning rate of 3e-4 for both policy and value networks. The entropy coefficient is 0.01. The clipping ratio for PPO is 0.2. The GAE parameter λ is 0.97. The policy network is a 2-layer neural network with 64 units per layer and ReLU activation. The value network is also a 2-layer neural network with 64 units per layer and ReLU activation. The number of samples for estimating the empirical quantile is 10000. |