Provable Policy Gradient for Robust Average-Reward MDPs Beyond Rectangularity

Authors: Qiuhao Wang, Yuqi Zha, Chin Pang Ho, Marek Petrik

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments confirm the global convergence of our new algorithm and demonstrate its superior performance.
Researcher Affiliation Academia 1Fintech Innovation Center, Research Institute for Digital Economy and Interdisciplinary Sciences, Southwestern University of Finance and Economics 2Department of Data Science, City University of Hong Kong 3Department of Computer Science, University of New Hampshire. Correspondence to: Chin Pang Ho <EMAIL>, Marek Petrik <EMAIL>.
Pseudocode Yes Algorithm 1 Robust Projected Policy Gradient (RP2G) Algorithm 2 Projected gradient ascent for solving the worstcase transition kernel Algorithm 3 Projected Langevin dynamics for solving the worst-case transition kernel
Open Source Code Yes To support reproducibility, the full source code used to generate the results is available at https://github.com/Charliez7/robust-AMDP.
Open Datasets Yes We now demonstrate the convergence and robustness of RP2G, along with the two proposed inner solution methods, on the standard benchmark, GARNET MDPs (Archibald et al., 1995).
Dataset Splits No The paper describes generating different numbers of problem instances (e.g., 50 instances for convergence validation, 30 for runtime comparison, 20 for general RAMDPs), which are distinct problem settings rather than a fixed dataset partitioned into standard train/test/validation splits.
Hardware Specification Yes All results were generated on an Apple M2 Max with 32 GB LPDDR5 memory.
Software Dependencies Yes The algorithms are implemented in Python 3.11.5, and we use Gurobi 11.0.3 to solve any linear optimization problems involved.
Experiment Setup Yes We run 50 sample instances with 250 iterations of RP2G for each GARNET problem. Specifically, we set the tolerance of the worst-case transition evaluation problem in RPMD to a fixed value δ = 10 5, whereas RP2G uses a decreasing sequence initialized at δ0 = 1 with a decay rate of τ = 0.95. We set the step size α = (1 γ)2 for each robust discounted MDP, consistent with the theoretical convergence analysis in the reference. For RP2G, we use a step size of β = 0.05.