Robust Average-Reward Reinforcement Learning

Authors: Yue Wang, Alvaro Velasquez, George Atia, Ashley Prater-Bennette, Shaofeng Zou

JAIR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we numerically verify our theoretical results. We aim to verify two aspects of our methods: the convergence of the algorithms, and the robustness of them. Additional experiments can be found in Appendix A.
Researcher Affiliation Collaboration Yue Wang EMAIL University of Central Florida Alvaro Velasquez EMAIL University of Colorado Boulder George Atia EMAIL University of Central Florida Ashley Prater-Bennette EMAIL Air Force Research Laboratory Shaofeng Zou EMAIL University at Buffalo, The State University of New York
Pseudocode Yes Algorithm 1 Robust VI: Policy Evaluation Algorithm 2 Robust VI: Optimal Control Algorithm 3 Robust RVI Algorithm 4 Robust RVI TD Algorithm 5 Robust RVI Q-learning
Open Source Code No The paper does not contain any explicit statements or links indicating the release of open-source code for the methodology described.
Open Datasets Yes We first verify the convergence of our robust RVI TD and Q-learning algorithms under a Garnet problem G(30, 20) (Archibald et al., 1995). We first consider the recycling robot problem (Example 3.3 (Sutton & Barto, 2018)). We further verify our robust RVI TD algorithm and robust RVI Q-learning under the Frozen Lake environment of Open AI (Brockman et al., 2016).
Dataset Splits No The paper describes using problem environments like the Garnet problem, Recycling Robot, and Frozen-Lake. While these environments define how data (experiences/trajectories) are generated during reinforcement learning, the paper does not specify fixed training/test/validation splits of pre-collected datasets in terms of percentages, sample counts, or explicit splitting methodologies. For example, it doesn't state how collected trajectories are divided for evaluation beyond the inherent process of RL training and policy evaluation.
Hardware Specification No The paper does not explicitly describe any specific hardware components (e.g., GPU/CPU models, memory, or accelerator types) used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software libraries, programming languages, or environments used in the experiments. It mentions 'Open AI' in Appendix A.2, but not a version.
Experiment Setup Yes We set the radius of the uncertainty set ζ = 0.4, αn = 0.01, f(V ) = P s V (s) /|S| and f(Q) = P s,a Q(s,a) /|S||A|. We set ζ = 0.4 and implement our algorithms and vanilla Q-learning under the nominal environment (α = β = 0.5) with stepsize 0.01. We first set ζ = 0.4 and αt = 0.01, and implement our algorithms and vanilla Q-learning under the nominal environment where Dt Uniform(0, 16) is generated following the uniform distribution.