QPRL : Learning Optimal Policies with Quasi-Potential Functions for Asymmetric Traversal
Authors: Jumman Hossain, Nirmalya Roy
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our experiments demonstrate that QPRL attains state-of-the-art performance across various navigation and control tasks, significantly reducing irreversible constraint violations by approximately 4 compared to baselines. Extensive Empirical Validation: We conduct comprehensive empirical evaluations across a range of challenging environments with asymmetric traversal costs, showing that QPRL significantly outperforms state-of-the-art methods in terms of sample efficiency, asymmetric cost handling, and overall performance. |
| Researcher Affiliation | Academia | 1Department of Information Systems, University of Maryland, Baltimore County, USA. Correspondence to: Jumman Hossain <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Quasi-Potential Reinforcement Learning (QPRL) 1: Input: Replay buffer D, learning rates αϕ, αψ, αθ, αω, threshold ϵ 2: for iteration = 1 to N do 3: Sample batch {(si, ai, s i, ci, gi)}B i=1 D 4: Update Encoder & Transition Model: 5: zi = fϕ(si), ˆz i = Tψ(zi, ai) 6: LT = 1 i ˆz i fϕ(s i) 2 7: Update ϕ, ψ using ϕ,ψLT 8: Update Quasi-Potential Function: 9: LU = 1 i Φθ(gi) Φθ(si) 10: +Ψθ(si gi) ci 2 11: Lconstraint = 1 12: Ψθ(si s i) (ci Φθ(s i) + Φθ(si)) 2 13: Update θ using θ(LU + λLconstraint) 14: Update Policy with Safety Layer: 15: zi = fϕ(si), ai = πω(si, gi) 16: ˆz i = Tψ(zi, ai) 17: ˆdi = Φθ(gi) Φθ(si) + Ψθ(si gi) 18: Lπ = 1 i ˆdi 19: +λ max 0, Φθ(ˆz i) Φθ(si) ϵ 20: Update ω using ωLπ 21: end for |
| Open Source Code | No | Project Page: https://pralgomathic.github.io/qprl |
| Open Datasets | Yes | We evaluate QPRL in environments that exhibit significant asymmetries in traversal costs, emphasizing the need for optimal path planning: Asymmetric Grid World: A 20x20 grid with direction-dependent traversal costs. Moving uphill incurs a cost of 2, while moving downhill costs 0.5. The agent navigates obstacles to reach the goal at the opposite corner. Mountain Car (Modified): The classic Mountain Car problem is modified with asymmetric costs, where moving uphill incurs a penalty of -1 and downhill costs -0.1. Fetch Push (Asymmetric): A modified Fetch Pushv1 environment where pushing objects uphill requires more energy than moving downhill, simulating realworld manipulation challenges. Lunar Lander-v2 (Asymmetric): A continuous control task where upward thrust incurs higher fuel costs than lateral movement. Maze2D Success Rate (%) (in Table 1) D4RL Maze2D Environments: In the offline settings, we follow a reward structure that combines distance-based rewards with asymmetric penalties for inefficient trajectories. |
| Dataset Splits | No | The paper mentions that "Evaluation is performed every 10,000 timesteps using 100 test episodes" and that D4RL Maze2D Environments are used "In the offline settings". While this implies data is used for training and evaluation, it does not explicitly provide specific dataset splits (e.g., percentages, sample counts, or explicit splitting methodology) for the datasets, even if standard benchmark environments might have default splits. The paper itself does not define these splits for reproducibility. |
| Hardware Specification | No | The paper does not explicitly describe any specific hardware used for running its experiments, such as GPU/CPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, libraries, or solvers used in the implementation of the methodology. |
| Experiment Setup | Yes | QPRL utilizes a neural network architecture for approximating the quasi-potential function Uθ. The architecture consists of the following components: State Encoder (fϕ): The state encoder maps states from the original state space S into a latent representation in R64. It is implemented as a feedforward neural network comprising two fully connected layers, each followed by Re LU activations. The first layer maps the input state to 128 hidden units, while the second layer projects it to the 64-dimensional latent space. Quasimetric Head (Uψ): The quasimetric head computes the quasi-potential between latent state representations. Given two states in the latent space, the quasimetric head consists of fully connected layers that ensure a non-negative output by applying Re LU activations. Optimization Algorithm: The model is optimized using the Adam optimizer, with learning rates αθ = 10 4 for the quasi-potential model and αλ = 10 3 for the Lagrange multiplier. Training Schedule: Training is conducted for a total of 1 million timesteps. Evaluation is performed every 10,000 timesteps using 100 test episodes to observe the progress of learning. Appendix G. Hyperparameter Sensitivity Analysis: We conducted a hyperparameter sensitivity analysis in the Asymmetric Grid World environment to understand the effect of various hyperparameters on the performance of Quasi-Potential Reinforcement Learning (QPRL). Specifically, we varied the learning rate, constraint threshold (ϵ), and batch size, measuring their impact on convergence speed and stability. |