Nonstationary Reinforcement Learning with Linear Function Approximation
Authors: Huozhi Zhou, Jinglin Chen, Lav R. Varshney, Ashish Jagmohan
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we provide numerical experiments to demonstrate the effectiveness of our proposed algorithms. ... In this section, we perform empirical experiments on synthetic datasets to illustrate the effectiveness of LSVI-UCB-Restart and Ada-LSVI-UCB-Restart. We compare the cumulative rewards of the proposed algorithms with five baseline algorithms... |
| Researcher Affiliation | Collaboration | Huozhi Zhou EMAIL Department of Electrical and Computer Engineering University of Illinois Urbana-Champaign Jinglin Chen EMAIL Department of Computer Science University of Illinois Urbana-Champaign Lav R. Varshney EMAIL Department of Electrical and Computer Engineering University of Illinois Urbana-Champaign Ashish Jagmohan EMAIL IBM Research |
| Pseudocode | Yes | Algorithm 1 LSVI-UCB-Restart Algorithm Algorithm 2 ADA-LSVI-UCB-Restart Algorithm |
| Open Source Code | No | The paper does not provide explicit links to source code or statements of code release. It only mentions: "Both of these two concurrent works do not have empirical results, and we are also the first one to conduct numerical experiments on online exploration for non-stationary MDPs (Section 6)." |
| Open Datasets | No | In this section, we perform empirical experiments on synthetic datasets to illustrate the effectiveness of LSVI-UCB-Restart and Ada-LSVI-UCB-Restart. ... Appendix E.1 Synthetic Linear MDP Construction. |
| Dataset Splits | No | The paper describes the generation of synthetic datasets for online reinforcement learning in an episodic setting, which involves continuous interaction with the environment rather than predefined offline training/test/validation splits. No explicit dataset split information is provided. |
| Hardware Specification | Yes | All experiments are performed on a Macbook Pro with 8 cores, 16 GB of RAM. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | For LSVI-UCB and LSVI-UCB-Restart, we set β = 0.001cd H p log(200d T). In addition, for LSVI-UCB-Restart we test the performance of two cases: (1) known global variation, where we set W = B 1/2T 1/2d1/2H 1/2 H; (2) unknown global variation (denoted LSVI-UCB-Unknown), where we set W = T 1/2d1/2H 1/2 H (the dynamic regret bound is O(Bd5/4H5/4T 3/4) for this case). For ADA-LSVI-UCB-Restart, we set the length of each block M = 0.2T 1/2d1/2H1/2 . Note that the tuning of hyperparameters is different from our theoretical derivations by some constant factors. ... In Epsilon-Greedy, instead of adding a bonus term as in LSVI-UCB, the agent takes the greedy action according to the current estimate of Q function with probability 1 ϵ, and takes the action uniformly at random with probability ϵ, where we set ϵ = 0.05. |