Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent States
Authors: Shi Dong, Benjamin Van Roy, Zhengyuan Zhou
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 5 plots cumulative moving average rewards attained by an optimistic Q-learning agent, which we will later present, averaged over two hundred independent simulations. |
| Researcher Affiliation | Academia | Shi Dong EMAIL Stanford University Benjamin Van Roy EMAIL Stanford University Zhengyuan Zhou EMAIL New York University |
| Pseudocode | Yes | Algorithm 1 discounted q learning Algorithm 2 growing horizon q learning |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide links to any code repositories. |
| Open Datasets | No | The paper uses a didactic example in a simulated environment (Service Rate Control) and specifies the environment dynamics in Appendix B. It does not use or provide access to any publicly available external datasets. |
| Dataset Splits | No | The paper mentions "averaged over two hundred independent simulations" for a didactic example but does not discuss standard dataset splits (training, validation, test) which are typically applied to pre-existing datasets. |
| Hardware Specification | No | The paper describes a theoretical framework and algorithm, with a simulated example. It does not provide any specific details about the hardware used to run these simulations. |
| Software Dependencies | No | The paper describes algorithms and their theoretical analysis, with a simulated example. It does not mention any specific software packages or their version numbers that would be necessary for reproduction. |
| Experiment Setup | Yes | To illustrate the importance of these schedules, let us revisit the service rate control example of Section 1.4. Simulation results reported in that section, which demonstrated the capability of optimistic Q-learning to improve performance over time, made use of particular smooth foo1(t) =1.5t1/5, foo2(t) =0.44t3/10p log(t), foo3(t) =1.5(t1/5 (t 1)1/5), foo4(t) =1. |