Q-learning with Nearest Neighbors
Authors: Devavrat Shah, Qiaomin Xie
NeurIPS 2018 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a d-dimensional state space and the discounted factor γ 2 (0, 1), given an arbitrary sample path with covering time L, we establish that the algorithm is guaranteed to output an "-accurate estimate of the optimal Q-function using e O L/("3(1 γ)7) samples. ...Our analysis consists of viewing our algorithm as a special case of a general biased stochastic approximation procedure, for which we establish non-asymptotic convergence guarantees. |
| Researcher Affiliation | Academia | Devavrat Shah Massachusetts Institute of Technology EMAIL Qiaomin Xie Massachusetts Institute of Technology EMAIL Both authors are affiliated with Laboratory for Information and Decision Systems (LIDS). DS is with the Department of EECS as well as Statistics and Data Science Center at MIT. |
| Pseudocode | Yes | Policy 1 Nearest-Neighbor Q-learning Input: Exploration policy , discount factor γ, number of steps T, bandwidth parameter h, and initial state Y0. Construct discretized state space Xh; initialize t = k = 0, 0 = 1, q0 0; Foreach (ci, a) 2 Zh, set N0(ci, a) = 0; end repeat Draw action at ( |Yt) and observe reward Rt; generate the next state Yt+1 p( |Yt, at); Foreach i such that Yt 2 Bi do N = 1 Nk(ci,at)+1; if Nk(ci, at) > 0 then (Gkqk)(ci, at) = (1 N)(Gkqk)(ci, at) + N Rt + γ maxb2A(ΓNNqk)(Yt+1, b) ; else (Gkqk)(ci, at) = Rt + γ maxb2A(ΓNNqk)(Yt+1, b); end Nk(ci, at) = Nk(ci, at) + 1 end if min(ci,a)2Zh Nk(ci, a) > 0 then Foreach (ci, a) 2 Zh do qk+1(ci, a) = (1 k)qk(ci, a) + k(Gkqk)(ci, a); end k = k + 1; k = β β+k; Foreach (ci, a) 2 Zh do Nk(ci, a) = 0; end end t = t + 1; until t T; return ˆq = qk |
| Open Source Code | No | The paper does not provide any information or links to open-source code for the described methodology. |
| Open Datasets | No | The paper is theoretical and analyzes an algorithm's properties; it does not use or provide access information for any specific public dataset for training. |
| Dataset Splits | No | The paper is theoretical and does not report on empirical experiments with dataset splits. No specific dataset split information (training, validation, test) is provided. |
| Hardware Specification | No | The paper is theoretical and does not describe any empirical experiments, thus no specific hardware specifications for running experiments are mentioned. |
| Software Dependencies | No | The paper is theoretical and does not describe empirical experiments or an implementation that would require specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and focuses on mathematical analysis of an algorithm rather than empirical evaluation. Therefore, it does not include details about an experimental setup, hyperparameters, or training configurations. |