QPRL : Learning Optimal Policies with Quasi-Potential Functions for Asymmetric Traversal

Authors: Jumman Hossain, Nirmalya Roy

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our experiments demonstrate that QPRL attains state-of-the-art performance across various navigation and control tasks, significantly reducing irreversible constraint violations by approximately 4 compared to baselines. Extensive Empirical Validation: We conduct comprehensive empirical evaluations across a range of challenging environments with asymmetric traversal costs, showing that QPRL significantly outperforms state-of-the-art methods in terms of sample efficiency, asymmetric cost handling, and overall performance.
Researcher Affiliation Academia 1Department of Information Systems, University of Maryland, Baltimore County, USA. Correspondence to: Jumman Hossain <EMAIL>.
Pseudocode Yes Algorithm 1 Quasi-Potential Reinforcement Learning (QPRL) 1: Input: Replay buffer D, learning rates αϕ, αψ, αθ, αω, threshold ϵ 2: for iteration = 1 to N do 3: Sample batch {(si, ai, s i, ci, gi)}B i=1 D 4: Update Encoder & Transition Model: 5: zi = fϕ(si), ˆz i = Tψ(zi, ai) 6: LT = 1 i ˆz i fϕ(s i) 2 7: Update ϕ, ψ using ϕ,ψLT 8: Update Quasi-Potential Function: 9: LU = 1 i Φθ(gi) Φθ(si) 10: +Ψθ(si gi) ci 2 11: Lconstraint = 1 12: Ψθ(si s i) (ci Φθ(s i) + Φθ(si)) 2 13: Update θ using θ(LU + λLconstraint) 14: Update Policy with Safety Layer: 15: zi = fϕ(si), ai = πω(si, gi) 16: ˆz i = Tψ(zi, ai) 17: ˆdi = Φθ(gi) Φθ(si) + Ψθ(si gi) 18: Lπ = 1 i ˆdi 19: +λ max 0, Φθ(ˆz i) Φθ(si) ϵ 20: Update ω using ωLπ 21: end for
Open Source Code No Project Page: https://pralgomathic.github.io/qprl
Open Datasets Yes We evaluate QPRL in environments that exhibit significant asymmetries in traversal costs, emphasizing the need for optimal path planning: Asymmetric Grid World: A 20x20 grid with direction-dependent traversal costs. Moving uphill incurs a cost of 2, while moving downhill costs 0.5. The agent navigates obstacles to reach the goal at the opposite corner. Mountain Car (Modified): The classic Mountain Car problem is modified with asymmetric costs, where moving uphill incurs a penalty of -1 and downhill costs -0.1. Fetch Push (Asymmetric): A modified Fetch Pushv1 environment where pushing objects uphill requires more energy than moving downhill, simulating realworld manipulation challenges. Lunar Lander-v2 (Asymmetric): A continuous control task where upward thrust incurs higher fuel costs than lateral movement. Maze2D Success Rate (%) (in Table 1) D4RL Maze2D Environments: In the offline settings, we follow a reward structure that combines distance-based rewards with asymmetric penalties for inefficient trajectories.
Dataset Splits No The paper mentions that "Evaluation is performed every 10,000 timesteps using 100 test episodes" and that D4RL Maze2D Environments are used "In the offline settings". While this implies data is used for training and evaluation, it does not explicitly provide specific dataset splits (e.g., percentages, sample counts, or explicit splitting methodology) for the datasets, even if standard benchmark environments might have default splits. The paper itself does not define these splits for reproducibility.
Hardware Specification No The paper does not explicitly describe any specific hardware used for running its experiments, such as GPU/CPU models, memory, or cloud instance types.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, libraries, or solvers used in the implementation of the methodology.
Experiment Setup Yes QPRL utilizes a neural network architecture for approximating the quasi-potential function Uθ. The architecture consists of the following components: State Encoder (fϕ): The state encoder maps states from the original state space S into a latent representation in R64. It is implemented as a feedforward neural network comprising two fully connected layers, each followed by Re LU activations. The first layer maps the input state to 128 hidden units, while the second layer projects it to the 64-dimensional latent space. Quasimetric Head (Uψ): The quasimetric head computes the quasi-potential between latent state representations. Given two states in the latent space, the quasimetric head consists of fully connected layers that ensure a non-negative output by applying Re LU activations. Optimization Algorithm: The model is optimized using the Adam optimizer, with learning rates αθ = 10 4 for the quasi-potential model and αλ = 10 3 for the Lagrange multiplier. Training Schedule: Training is conducted for a total of 1 million timesteps. Evaluation is performed every 10,000 timesteps using 100 test episodes to observe the progress of learning. Appendix G. Hyperparameter Sensitivity Analysis: We conducted a hyperparameter sensitivity analysis in the Asymmetric Grid World environment to understand the effect of various hyperparameters on the performance of Quasi-Potential Reinforcement Learning (QPRL). Specifically, we varied the learning rate, constraint threshold (ϵ), and batch size, measuring their impact on convergence speed and stability.