reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Q-learning with Nearest Neighbors

Authors: Devavrat Shah, Qiaomin Xie

NeurIPS 2018 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	As the main contribution, we provide tight ﬁnite sample analysis of the convergence rate. In particular, for MDPs with a d-dimensional state space and the discounted factor γ 2 (0, 1), given an arbitrary sample path with covering time L, we establish that the algorithm is guaranteed to output an "-accurate estimate of the optimal Q-function using e O L/("3(1 γ)7) samples. ...Our analysis consists of viewing our algorithm as a special case of a general biased stochastic approximation procedure, for which we establish non-asymptotic convergence guarantees.
Researcher Affiliation	Academia	Devavrat Shah Massachusetts Institute of Technology EMAIL Qiaomin Xie Massachusetts Institute of Technology EMAIL Both authors are afﬁliated with Laboratory for Information and Decision Systems (LIDS). DS is with the Department of EECS as well as Statistics and Data Science Center at MIT.
Pseudocode	Yes	Policy 1 Nearest-Neighbor Q-learning Input: Exploration policy , discount factor γ, number of steps T, bandwidth parameter h, and initial state Y0. Construct discretized state space Xh; initialize t = k = 0, 0 = 1, q0 0; Foreach (ci, a) 2 Zh, set N0(ci, a) = 0; end repeat Draw action at ( \|Yt) and observe reward Rt; generate the next state Yt+1 p( \|Yt, at); Foreach i such that Yt 2 Bi do N = 1 Nk(ci,at)+1; if Nk(ci, at) > 0 then (Gkqk)(ci, at) = (1 N)(Gkqk)(ci, at) + N Rt + γ maxb2A(ΓNNqk)(Yt+1, b) ; else (Gkqk)(ci, at) = Rt + γ maxb2A(ΓNNqk)(Yt+1, b); end Nk(ci, at) = Nk(ci, at) + 1 end if min(ci,a)2Zh Nk(ci, a) > 0 then Foreach (ci, a) 2 Zh do qk+1(ci, a) = (1 k)qk(ci, a) + k(Gkqk)(ci, a); end k = k + 1; k = β β+k; Foreach (ci, a) 2 Zh do Nk(ci, a) = 0; end end t = t + 1; until t T; return ˆq = qk
Open Source Code	No	The paper does not provide any information or links to open-source code for the described methodology.
Open Datasets	No	The paper is theoretical and analyzes an algorithm's properties; it does not use or provide access information for any specific public dataset for training.
Dataset Splits	No	The paper is theoretical and does not report on empirical experiments with dataset splits. No specific dataset split information (training, validation, test) is provided.
Hardware Specification	No	The paper is theoretical and does not describe any empirical experiments, thus no specific hardware specifications for running experiments are mentioned.
Software Dependencies	No	The paper is theoretical and does not describe empirical experiments or an implementation that would require specific software dependencies with version numbers.
Experiment Setup	No	The paper is theoretical and focuses on mathematical analysis of an algorithm rather than empirical evaluation. Therefore, it does not include details about an experimental setup, hyperparameters, or training configurations.