Mirror Descent Policy Optimisation for Robust Constrained Markov Decision Processes

Authors: David Mark Bossens, Atsushi Nitanda

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments confirm the benefits of mirror descent policy optimisation in constrained and unconstrained optimisation, and significant improvements are observed in robustness tests when compared to baseline policy optimisation algorithms. We evaluate MDPO-Robust-Lagrangian empirically by comparing it to robust-constrained variants of PPO-Lagrangian (Ray et al., 2019) and RMCPMD (Wang et al., 2024), and find significant improvements in the penalised return in worst-case and average-case test performance on dynamics in the uncertainty set. Section 6 is titled 'Experiments' and contains figures and tables presenting empirical results.
Researcher Affiliation Academia David M. Bossens EMAIL Centre for Frontier AI Research, Agency for Science, Technology and Research (A*STAR), Singapore Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR), Singapore. Atsushi Nitanda EMAIL Centre for Frontier AI Research, Agency for Science, Technology and Research (A*STAR), Singapore Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR), Singapore College of Computing and Data Science, Nanyang Technological University, Singapore. All listed institutions (A*STAR and Nanyang Technological University) are public research or academic institutions.
Pseudocode Yes Algorithm 1 Approximate TMA. Algorithm 2 Conservative Policy Iteration over transition kernels using Approximate TMA. Algorithm 3 Robust Sample-based PMD-PD (discrete setting). Algorithm 4 Robust Sample-based PMD-PD (continuous setting).
Open Source Code Yes The code for all experiments can be found at https://github.com/bossdm/MDPO-RCMDP.
Open Datasets Yes To assess our algorithm on an RCMDP, we introduce the robust constrained variant of the well-known Cartpole problem (Barto et al., 1983; Brockman et al., 2016) by modifying the RMDP from Wang et al. (2024). A second domain is the Inventory Management domain from Wang et al. (2024)... We use the same radial features and clipped uncertainty set as in the implementation on github https://github.com/Jerrison Wang/JMLR-DRPMD. The third (and last) domain in the experiments introduces a multi-dimensional variant of the above Inventory Management problem.
Dataset Splits No The paper uses reinforcement learning environments (Cartpole, Inventory Management) rather than static datasets. It describes how test environments are generated with 'distortion levels' and 'perturbations' and mentions sample budgets for training duration, but not traditional training/test/validation splits for a static dataset.
Hardware Specification No For MDPO and PPO experiments, we use four parallel environments with four CPUs. This mentions a generic number of CPUs but does not specify the model, clock speed, or any other detailed hardware specifications for the computing environment.
Software Dependencies No All the algorithms are implemented in Pytorch. The code for RMCPMD is based on the original implementation... The code for PPO is taken from the Stable Baselines3 class. The paper mentions software such as Pytorch and Stable Baselines3, but does not provide any specific version numbers for these or other software components.
Experiment Setup Yes Hyperparameter Setting Policy architecture MLP 4 Inputs Linear(128) Dropout(0.6) Linear(128) Softmax(2) Critic architecture MLP 4 Inputs Linear(128) Dropout(0.6) Re LU Linear(1+m) Policy learning rate (η) 3e-4 Policy optimiser Adam GAE lambda (λGAE) 0.95 Discount factor (γ) 0.99 Batch and minibatch size PPO/MDPO policy update: 400 * 4 time steps per batch, minibatch 32, episode steps at most 100 Policy epochs MCPMD: 1 PPO: 50, early stopping with KL target 0.01 MDPO: 5 Transition kernel architecture multi-variate Gaussian with parametrised mean in (1 + δ)µc(s) where δ (-0.005, 0.05, -0.005, 0.05) and covariance σI where σ = 1e-7 LTMA learning rate (ηξ) 1e-7 Dual learning rate (ηλ) 1e-3 Dual epochs MCPMD: 1 PPO: 50 with early stopping based on target-kl 0.01 MDPO: 5 Multiplier initialise to 5, linear, clipping to λmax = 50 for non-augmented algorithms (from Table 8 and other tables in Appendix E).