reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Solving Robust MDPs through No-Regret Dynamics

Authors: Etash Kumar Guha

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We utilize our algorithmic framework to provide different algorithms for Robust MDPs with gradient-dominance, smooth MDPs, and strongly gradient-dominated MDPs. We also prove robust convergence rates with only a convexity assumption on the uncertainty set of environments. Alongside these guarantees, we run several small experiments on the convergence behavior of our algorithm where the Value Function is the optimization objective in the Grid World MDP. Our small experiments corroborate this convergence rate in Section 7.
Researcher Affiliation	Collaboration	Etash Guha etash.guha@sambanovasystems Samba Nova Systems, University of Washington
Pseudocode	Yes	Algorithm 1: No-Regret RL Data: T for t [T] do... Algorithm 2: A set of useful online learning algorithms Input: η, Oα for t = 1, . . . , T do... Algorithm 3: Projected Gradient Descent Data: Initial guess x0, step size β, projection operator Proj X for t = 0 to TO 1 do...
Open Source Code	Yes	We have provided a Git Hub repository to reproduce our experiments.
Open Datasets	Yes	We will use Algorithm 4 to optimize a policy in the Grid World MDP (Sutton & Barto, 2018).
Dataset Splits	No	The paper describes the Grid World MDP environment settings such as 'grid is 5 states tall and 5 states wide' and 'The goal state is the opposite corner of the grid'. However, it does not explicitly provide training/test/validation dataset splits, which are not typical for an MDP environment but for a dataset with predefined subsets.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory used for running the experiments. It only mentions general software environments like 'Open AI Gym'.
Software Dependencies	No	We use the Open AI Gym environment to set up our experiments. For all optimization tasks, we utilitize the constrained optimization libraries in Sci Py. The paper mentions software components 'Open AI Gym' and 'Sci Py' but does not specify their version numbers.
Experiment Setup	Yes	We experiment in the setting where the grid is 5 states tall and 5 states wide. The goal state is the opposite corner of the grid. At each step, the policy can take any of four actions. The next state is sampled respectively from the transition matrix. If the policy lands in the goal state, it receives a reward of 10, and the MDP is finished. Otherwise, it receives a reward of 1. We experiment in the setting where the grid is 5 states tall and 5 states wide... The sets in question will be T = {T s.t. T T0 q τ}. Here, T0 is some randomly generated initial transition matrix, q is a hyperparameter affecting the shape of the transition set, and τ is the radius of the transition set. We demonstrate robustness improvement over several values of τ and q. We plot the convergence of our algorithm over q {1, 2} and τ {.1, .2, .3, .5} in Figure 1.