Robust Average-Reward Reinforcement Learning
Authors: Yue Wang, Alvaro Velasquez, George Atia, Ashley Prater-Bennette, Shaofeng Zou
JAIR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we numerically verify our theoretical results. We aim to verify two aspects of our methods: the convergence of the algorithms, and the robustness of them. Additional experiments can be found in Appendix A. |
| Researcher Affiliation | Collaboration | Yue Wang EMAIL University of Central Florida Alvaro Velasquez EMAIL University of Colorado Boulder George Atia EMAIL University of Central Florida Ashley Prater-Bennette EMAIL Air Force Research Laboratory Shaofeng Zou EMAIL University at Buffalo, The State University of New York |
| Pseudocode | Yes | Algorithm 1 Robust VI: Policy Evaluation Algorithm 2 Robust VI: Optimal Control Algorithm 3 Robust RVI Algorithm 4 Robust RVI TD Algorithm 5 Robust RVI Q-learning |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating the release of open-source code for the methodology described. |
| Open Datasets | Yes | We first verify the convergence of our robust RVI TD and Q-learning algorithms under a Garnet problem G(30, 20) (Archibald et al., 1995). We first consider the recycling robot problem (Example 3.3 (Sutton & Barto, 2018)). We further verify our robust RVI TD algorithm and robust RVI Q-learning under the Frozen Lake environment of Open AI (Brockman et al., 2016). |
| Dataset Splits | No | The paper describes using problem environments like the Garnet problem, Recycling Robot, and Frozen-Lake. While these environments define how data (experiences/trajectories) are generated during reinforcement learning, the paper does not specify fixed training/test/validation splits of pre-collected datasets in terms of percentages, sample counts, or explicit splitting methodologies. For example, it doesn't state how collected trajectories are divided for evaluation beyond the inherent process of RL training and policy evaluation. |
| Hardware Specification | No | The paper does not explicitly describe any specific hardware components (e.g., GPU/CPU models, memory, or accelerator types) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software libraries, programming languages, or environments used in the experiments. It mentions 'Open AI' in Appendix A.2, but not a version. |
| Experiment Setup | Yes | We set the radius of the uncertainty set ζ = 0.4, αn = 0.01, f(V ) = P s V (s) /|S| and f(Q) = P s,a Q(s,a) /|S||A|. We set ζ = 0.4 and implement our algorithms and vanilla Q-learning under the nominal environment (α = β = 0.5) with stepsize 0.01. We first set ζ = 0.4 and αt = 0.01, and implement our algorithms and vanilla Q-learning under the nominal environment where Dt Uniform(0, 16) is generated following the uniform distribution. |