reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Provable Policy Gradient for Robust Average-Reward MDPs Beyond Rectangularity

Authors: Qiuhao Wang, Yuqi Zha, Chin Pang Ho, Marek Petrik

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical experiments confirm the global convergence of our new algorithm and demonstrate its superior performance.
Researcher Affiliation	Academia	1Fintech Innovation Center, Research Institute for Digital Economy and Interdisciplinary Sciences, Southwestern University of Finance and Economics 2Department of Data Science, City University of Hong Kong 3Department of Computer Science, University of New Hampshire. Correspondence to: Chin Pang Ho <EMAIL>, Marek Petrik <EMAIL>.
Pseudocode	Yes	Algorithm 1 Robust Projected Policy Gradient (RP2G) Algorithm 2 Projected gradient ascent for solving the worstcase transition kernel Algorithm 3 Projected Langevin dynamics for solving the worst-case transition kernel
Open Source Code	Yes	To support reproducibility, the full source code used to generate the results is available at https://github.com/Charliez7/robust-AMDP.
Open Datasets	Yes	We now demonstrate the convergence and robustness of RP2G, along with the two proposed inner solution methods, on the standard benchmark, GARNET MDPs (Archibald et al., 1995).
Dataset Splits	No	The paper describes generating different numbers of problem instances (e.g., 50 instances for convergence validation, 30 for runtime comparison, 20 for general RAMDPs), which are distinct problem settings rather than a fixed dataset partitioned into standard train/test/validation splits.
Hardware Specification	Yes	All results were generated on an Apple M2 Max with 32 GB LPDDR5 memory.
Software Dependencies	Yes	The algorithms are implemented in Python 3.11.5, and we use Gurobi 11.0.3 to solve any linear optimization problems involved.
Experiment Setup	Yes	We run 50 sample instances with 250 iterations of RP2G for each GARNET problem. Specifically, we set the tolerance of the worst-case transition evaluation problem in RPMD to a fixed value δ = 10 5, whereas RP2G uses a decreasing sequence initialized at δ0 = 1 with a decay rate of τ = 0.95. We set the step size α = (1 γ)2 for each robust discounted MDP, consistent with the theoretical convergence analysis in the reference. For RP2G, we use a step size of β = 0.05.