reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tilted Quantile Gradient Updates for Quantile-Constrained Reinforcement Learning

Authors: Chenglin Li, Guangchun Ruan, Hua Geng

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that the proposed model fully meets safety requirements (quantile constraints) while outperforming the state-of-the-art benchmarks with higher return. In the Experiments section, simulation results demonstrate that the proposed model fully guarantee safety while outperforming the state-of-the-art benchmarks with higher returns. We evaluate the proposed TQPO on three classic safe RL tasks: Simple Env, Dynamic Env and Gremlin Env from Mujoco and Safety Gym.
Researcher Affiliation	Academia	1Department of Automation, Tsinghua University, Beijing, China 2Laboratory for Information & Decision Systems, MIT, Boston, USA EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Overall, the training process of TQPO iterates as follows: Generate a batch of samples {s0, s1, . . . , s N} πθ Update the value network parameter ϕ Update three main parameters q, θ and λ as follows: qk+1 = qk + αk(ˆq1 ε qk) (17a) θk+1 = θk + βk θLθ (17b) λk+1 = λk + ηk(qk d) (17c) where αk, βk, and ηk are the update rates of the three parameters respectively.
Open Source Code	Yes	1Code is available at https://github.com/Charlie Leeeee/TQPO
Open Datasets	Yes	We evaluate the proposed TQPO on three classic safe RL tasks: Simple Env, Dynamic Env and Gremlin Env from Mujoco and Safety Gym (Todorov, Erez, and Tassa 2012; Ray, Achiam, and Amodei 2019).
Dataset Splits	No	The paper describes generating batches of samples for training (e.g., 'Generate a batch of samples {s0, s1, . . . , s N} πθ') and mentions 'an episode of 1000 steps' and running experiments with 'five random seeds'. However, it does not specify explicit training/test/validation dataset splits with percentages or counts, as typically understood for pre-existing datasets in supervised learning. The data for RL is generated dynamically through interaction with the environment rather than being split from a fixed dataset.
Hardware Specification	No	The paper does not explicitly describe any specific hardware components (e.g., GPU models, CPU types, memory specifications) used to run the experiments. It mentions the simulation environments (Mujoco and Safety Gym) but not the underlying computational hardware.
Software Dependencies	No	The paper mentions that 'The algorithm is based on the classic RL algorithm Proximal Policy Optimization (PPO) (Schulman et al. 2017)' and refers to 'rlpyt: A research code base for deep reinforcement learning in pytorch. ar Xiv preprint ar Xiv:1909.01500.' in the references. However, it does not specify version numbers for any software, libraries, or frameworks like PPO, PyTorch, or rlpyt.
Experiment Setup	Yes	The cost threshold is set to d = 15. Since many safety-critical applications require a high safety probability above 90%, 1 ε = 90%, 95% are used in the experiments. All the experiments are conducted with five random seeds. Implementation details of TQPO can be found in Appendix B. Appendix B specifies: We use a discount factor γ = 0.99. The update rate for the quantile is α = 1e-4. The update rate for the policy network parameter is β = 3e-4. The smoothing factor δ in Eqn. (14) is set to 0.1. We use a batch size of 2048. We train for 500 epochs, with 4 steps per epoch. We use the Adam optimizer with a learning rate of 3e-4 for both policy and value networks. The entropy coefficient is 0.01. The clipping ratio for PPO is 0.2. The GAE parameter λ is 0.97. The policy network is a 2-layer neural network with 64 units per layer and ReLU activation. The value network is also a 2-layer neural network with 64 units per layer and ReLU activation. The number of samples for estimating the empirical quantile is 10000.