Solving Robust MDPs through No-Regret Dynamics

Authors: Etash Kumar Guha

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We utilize our algorithmic framework to provide different algorithms for Robust MDPs with gradient-dominance, smooth MDPs, and strongly gradient-dominated MDPs. We also prove robust convergence rates with only a convexity assumption on the uncertainty set of environments. Alongside these guarantees, we run several small experiments on the convergence behavior of our algorithm where the Value Function is the optimization objective in the Grid World MDP. Our small experiments corroborate this convergence rate in Section 7.
Researcher Affiliation Collaboration Etash Guha etash.guha@sambanovasystems Samba Nova Systems, University of Washington
Pseudocode Yes Algorithm 1: No-Regret RL Data: T for t [T] do... Algorithm 2: A set of useful online learning algorithms Input: η, Oα for t = 1, . . . , T do... Algorithm 3: Projected Gradient Descent Data: Initial guess x0, step size β, projection operator Proj X for t = 0 to TO 1 do...
Open Source Code Yes We have provided a Git Hub repository to reproduce our experiments.
Open Datasets Yes We will use Algorithm 4 to optimize a policy in the Grid World MDP (Sutton & Barto, 2018).
Dataset Splits No The paper describes the Grid World MDP environment settings such as 'grid is 5 states tall and 5 states wide' and 'The goal state is the opposite corner of the grid'. However, it does not explicitly provide training/test/validation dataset splits, which are not typical for an MDP environment but for a dataset with predefined subsets.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory used for running the experiments. It only mentions general software environments like 'Open AI Gym'.
Software Dependencies No We use the Open AI Gym environment to set up our experiments. For all optimization tasks, we utilitize the constrained optimization libraries in Sci Py. The paper mentions software components 'Open AI Gym' and 'Sci Py' but does not specify their version numbers.
Experiment Setup Yes We experiment in the setting where the grid is 5 states tall and 5 states wide. The goal state is the opposite corner of the grid. At each step, the policy can take any of four actions. The next state is sampled respectively from the transition matrix. If the policy lands in the goal state, it receives a reward of 10, and the MDP is finished. Otherwise, it receives a reward of 1. We experiment in the setting where the grid is 5 states tall and 5 states wide... The sets in question will be T = {T s.t. T T0 q τ}. Here, T0 is some randomly generated initial transition matrix, q is a hyperparameter affecting the shape of the transition set, and τ is the radius of the transition set. We demonstrate robustness improvement over several values of τ and q. We plot the convergence of our algorithm over q {1, 2} and τ {.1, .2, .3, .5} in Figure 1.